Literature DB >> 16381934

The Mouse Functional Genome Database (MfunGD): functional annotation of proteins in the light of their cellular context.

Andreas Ruepp¹, Octave Noubibou Doudieu, Jos van den Oever, Barbara Brauner, Irmtraud Dunger-Kaltenbach, Gisela Fobo, Goar Frishman, Corinna Montrone, Christine Skornia, Steffi Wanka, Thomas Rattei, Philipp Pagel, Louise Riley, Dmitrij Frishman, Dimitrij Surmeli, Igor V Tetko, Matthias Oesterheld, Volker Stümpflen, H Werner Mewes.

Abstract

MfunGD (http://mips.gsf.de/genre/proj/mfungd/) provides a resource for annotated mouse proteins and their occurrence in protein networks. Manual annotation concentrates on proteins which are found to interact physically with other proteins. Accordingly, manually curated information from a protein-protein interaction database (MPPI) and a database of mammalian protein complexes is interconnected with MfunGD. Protein function annotation is performed using the Functional Catalogue (FunCat) annotation scheme which is widely used for the analysis of protein networks. The dataset is also supplemented with information about the literature that was used in the annotation process as well as links to the SIMAP Fasta database, the Pedant protein analysis system and cross-references to external resources. Proteins that so far were not manually inspected are annotated automatically by a graphical probabilistic model and/or superparamagnetic clustering. The database is continuously expanding to include the rapidly growing amount of functional information about gene products from mouse. MfunGD is implemented in GenRE, a J2EE-based component-oriented multi-tier architecture following the separation of concern principle.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Multiprotein Complexes

Year: 2006 PMID： 16381934 PMCID： PMC1347437 DOI： 10.1093/nar/gkj074

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The Mouse functional Genome Database (MfunGD) aims to provide a high-quality information resource for the research community incorporating manual annotation of gene products, in particular with respect to the cellular function in the context of their interaction. Mus musculus is one of the most thoroughly studied mammalian model organisms. For thousands of mouse proteins, functional properties have been predicted or experimentally investigated and part of this information is stored in databases like UniProt and MGI (1,2). Due to its exceptional importance as a model organism, the genome sequence of mouse was the second mammalian genome that has been sequenced (3). Mouse is genetically tractable and large collections of mouse mutants exist which yield invaluable insights into the function of mammalian genes (4). Unfortunately, the detection of the genotype of mouse mutants that are obtained by treatment with chemical compounds such as ENU is extremely time-consuming and labour intensive. In order to understand the function of mammalian genes in context and to identify the causes of complex diseases having a genetic background in mammals, bridging the gap between genotype and phenotype will be one of the most important and challenging tasks for the future. To achieve this goal, the knowledge about the function of isolated proteins needs to be extended to their functional context in the cellular environment. Such an endeavour requires the integration of different sources of information like protein–protein interactions, genetic interactions as well as co-expression data. The integration of these data results in distinct but interconnected networks of proteins responsible for defined functional tasks in cells, so-called functional modules (5). However, so far no reliable data set of functional modules for a mammalian organism exists. As an important step towards this goal, we combine computational methods with manual annotation to the mouse proteome with strong emphasis on the cellular context.

SYSTEM ARCHITECTURE

A comprehensive genome resource must not only be capable to store and display information on gene products but also needs to support manual and semi-automatic annotation. To fulfil these requirements, we implemented MfunGD within the MIPS Genome Research Environment (GenRE). This allows seamless integration of database management systems as well as various components required for a flexible annotation pipeline. GenRE is a J2EE-based component-oriented multi-tier architecture hiding the complexity of the procedures from the user. For example, the manual annotation process requires not only the access to various data sources, but also its support needs the integration of different algorithms such as clustering of protein family members in a structured way. These databases and applications are typically distributed across physically separated computing resources. We developed an integration tier capable to level the differences between the underlying resources by conversion into so-called data access objects (DAOs). The main advantage of the DAO design pattern within MfunGD is the uniform access of any resource on a JAVA object level. For databases, we used DAOs based on HIBERNATE a high-performance object/relational persistence and query service, whereas for applications the DAOs were explicitly designed. On top of the integration tier, we implemented a so-called business tier based on Enterprise Java Beans (EJBs). EJBs are the core components for any kind of application (business) logic related to complex information processing within the annotation pipeline and advanced queries. For further unification of information, the EJB components accept and deliver results in XML format. The XML format is not only used in the completely separated web-tier for rendering HTML output with XSL style sheets (see Figure 1), but also for the communication with rich-clients for manual annotation hence reducing the time-consuming multiple invocation of EJB methods by the transmission of only one comprehensive XML document.

Figure 1

Screenshot of the MfunGD entry for the enzyme alpha enolase.

A further advantage of the component-oriented approach is the extension of the system with minimal effort. For example, MfunGD has been extended with a configurable advanced query interface component used also by different resources within MIPS. This interface provides the possibility to query the database using logical combinations of terms in a similar way to the Entrez service. Customizable full-text searches across the database are possible without any knowledge of the underlying data structure. Querying indexed information is done by simple expressions allowing wildcards and the combination with logical operators. An example query for searching all mitochondrial proteins (functional category 70.16) with >1000 amino acids is simply performed by the following expression: ‘70.16*[FCC] >1000[PIL]’ instead of a complicated native database query involving several table joins.

DATA CONTENT

An inherent problem in the analysis of mammalian genomes is the lack of a complete and stable set of all exant full-length transcripts. New transcripts and splice variants are published regularly requiring continuous updating of datasets such as RefSeq. Compared to the September 2004 RefSeq mouse release, the dataset of March 2005 contained 1331 new entries, 542 with changes in the transcript sequence whereas 196 that were removed. The MfunGD will also allow updates of genome assembly, transcript data and gene models. The basis of the MfunGD dataset is a complement of gene products that was obtained by Softberry Inc., which used the FGENESH++C software as gene predictor. This procedure resulted in 42 049 gene products for mouse. Those include 13 259 gene products which were identical or highly similar to RefSeq cDNAs, 18 330 gene models with significant similarity to a non-redundant (NR) database and 10 460 gene models without significant similarity (>90% identity) to the NR database. Transcripts of this dataset are currently mapped to the curated RefSeq dataset from the mouse strain C57BL/6J by a mapping procedure based on the Blat software (6). Known transcripts from external resources which are not yet present in our dataset are added. The Softberry gene models were based on the Build 30 assembly of the mouse genome (mm3, Feb. 2003). These models were mapped to the May 2004 mouse genome assembly. The UCSC Genome Browser (7) allows visualization of the MfunGD transcripts, gene models and RefSeq data.

ANNOTATION

MfunGD is a resource for manually and automatically annotated proteins and genes from mouse. General protein and gene features like InterPro domains, 3D structure and physical properties are precalculated by the Pedant system (8). InterPro domains and predicted transmembrane domains are shown on the MfunGD web page, other features can be accessed via hyperlinks. Results from Fasta sequence similarity searches against >3 000 000 protein sequences can be retrieved from the SIMAP database (9). Attributes like gene names, protein names and synonyms are retrieved from public resources like UniProt (1), MGD (2) or RefSeq (10). In addition, MfunGD contains information about literature that was used for manual annotation as well as protein ID, FunCat annotation, comments, update information and cross-references to RefSeq, UniProt and MGD.

Manual annotation

A central part of the annotation process is the assignment of functional categories to protein entries. At MIPS, the Functional Catalogue (FunCat) is used for function annotation. This annotation scheme has been applied to the manual annotation of several model organisms (11). FunCat is a hierarchically structured, organism-independent, flexible, controlled and scalable (structured) classification system enabling the functional description of proteins from any organism (11). The capabilities of the FunCat are not only limited to the functional annotation of genomes, but also provide a powerful tool in order to analyse genome- and proteome-wide data generated by large-scale transcriptome/proteome experiments (12–14) as well as the computational analysis of functional networks (15,16). The versatile application makes FunCat a powerful and intensively used tool for integration of protein function data from different sources and thus fulfils the needs of bioinformatics approaches in systems biology. The assignment of functional categories in the manual annotation process depends primarily on the experimental evidence given in literature. Here, the hierarchical structure of FunCat allows adjusting the specificity of the level of the functional category to the information content of the experiments. In addition to information from literature, data from other sources like InterPro (17) and FunCatDB (11) as well as external resources like SwissProt (1) GenBank (18) and MGI (2) are used in order to obtain a comprehensive overview of the cellular function of respective proteins. Evaluation of the information and the resulting assignment of functional categories lie in the responsibility of trained annotators. So far, ∼4000 mouse proteins have been manually curated. Experimentally investigated proteins are on average associated with 4.6 FunCat categories. Information that exceeds the specificity of FunCat categories is stored as E.C. numbers or is presented in comment fields.

FunCat annotation using hRMN/gSPC

The high number of gene products in mammals requires supporting manual annotation by automated prediction of protein functions. The relation between sequence similarity and functional conservation has been well established for protein domains and complete proteins. Since transfer of functional annotation given high sequence conservation is reliable, MfunGD data sources for human, mouse and rat as well as other mammalian proteins annotated in SwissProt were used. For the human genome, manual annotation was obtained from Biomax Informatics AG. Using conservative thresholds, FunCat information has been transferred to 13 193 mouse proteins. If any of these protein entries is subsequently subjected to manual inspection, information of the automated process is supplemented with literature information and modified, if necessary. Available in-house manual annotation was complemented by an automated mapping of available manual GO-annotations to FunCat categories. Based on sequence similarity data and InterPro domains, further sequence-associated information was compiled. Automated annotation support is provided by two different systems, gSPC and hRMN. gSPC stands for ‘global SuperParamagnetic Clustering’ (19) in which sequences are clustered in a Monte Carlo process according to a sequence similarity score. Functional annotation is then transferred within a cluster from known to unclassified proteins by a consensus process among the known proteins in the same cluster. An internal parameter of the process determines the granularity, specificity and coverage of clusters. SPC has been further developed into globalSPC by systematic variation of the parameter settings (19). The hRMN method (heterogeneous Relational Markov Network) generates confidence values for the assignment of functional classification. hRMN is based on a network graph able to employ any parameter that can be assigned to a pair of sequences such as sequence similarity, InterPro domains or quantitative data such as correlation of transcript regulation. Independent graphs formed by independent data sources are connected to form a Markov network, taking advantage of the synergistic effects between them (20). In the graph, nodes represent proteins and the edges are weighted according to the strength of the relation between adjacent nodes. Nodes may have FunCat labels as attributes whose propagation from known to unknown proteins is assessed utilizing Belief Propagation (19)/Generalized BP (20). Note that Belief Propagation does not require the sources to be uncorrelated or have similar distribution. Moreover, BP allows us to simultaneously calculate the marginal beliefs for all (not only one) attributes, and solves conflicts incurred by the inherent property of FunCat to allow more than one functional label for each protein, which confronts classifiers with a non-standard, soft classification task.

INVESTIGATION OF PROTEINS IN THEIR CELLULAR CONTEXT

While the primary goal of any sequencing effort is the identification of the genetic elements of an organism, the ultimate perspective in the functional analysis is a better understanding of the molecular function to uncover the molecular cause of human diseases. With the first mammalian genomes at hand, it becomes obvious that a gene-centric view is fundamentally insufficient to understand complex cellular processes such as signal transduction, gene regulation or cell differentiation. An understanding of life processes requires the integration of genome as well as transcriptome, metabolome and other -omics sciences. Any quantitative model of cellular networks must combine different types of information. The integration of our Mammalian Protein–Protein Interaction Database (21) and the Mammalian Protein Complex database (22) with public protein–protein interaction data, gene expression data and text mining results will form the basis for the compilation of functional modules from mouse. This data set will be manually curated in order to serve as a reference data set for functional modules for a mammalian model organism and moreover provide a useful resource on the way to close the gap between genotype and phenotype.

20 in total

1. From molecular to modular cell biology.

Authors: L H Hartwell; J J Hopfield; S Leibler; A W Murray
Journal: Nature Date: 1999-12-02 Impact factor: 49.962

2. BLAT--the BLAST-like alignment tool.

Authors: W James Kent
Journal: Genome Res Date: 2002-04 Impact factor: 9.043

3. How well do we understand the clusters found in microarray data?

Authors: Amanda Clare; Ross D King
Journal: In Silico Biol Date: 2002

4. Initial sequencing and comparative analysis of the mouse genome.

Authors: Robert H Waterston; Kerstin Lindblad-Toh; Ewan Birney; Jane Rogers; Josep F Abril; Pankaj Agarwal; Richa Agarwala; Rachel Ainscough; Marina Alexandersson; Peter An; Stylianos E Antonarakis; John Attwood; Robert Baertsch; Jonathon Bailey; Karen Barlow; Stephan Beck; Eric Berry; Bruce Birren; Toby Bloom; Peer Bork; Marc Botcherby; Nicolas Bray; Michael R Brent; Daniel G Brown; Stephen D Brown; Carol Bult; John Burton; Jonathan Butler; Robert D Campbell; Piero Carninci; Simon Cawley; Francesca Chiaromonte; Asif T Chinwalla; Deanna M Church; Michele Clamp; Christopher Clee; Francis S Collins; Lisa L Cook; Richard R Copley; Alan Coulson; Olivier Couronne; James Cuff; Val Curwen; Tim Cutts; Mark Daly; Robert David; Joy Davies; Kimberly D Delehaunty; Justin Deri; Emmanouil T Dermitzakis; Colin Dewey; Nicholas J Dickens; Mark Diekhans; Sheila Dodge; Inna Dubchak; Diane M Dunn; Sean R Eddy; Laura Elnitski; Richard D Emes; Pallavi Eswara; Eduardo Eyras; Adam Felsenfeld; Ginger A Fewell; Paul Flicek; Karen Foley; Wayne N Frankel; Lucinda A Fulton; Robert S Fulton; Terrence S Furey; Diane Gage; Richard A Gibbs; Gustavo Glusman; Sante Gnerre; Nick Goldman; Leo Goodstadt; Darren Grafham; Tina A Graves; Eric D Green; Simon Gregory; Roderic Guigó; Mark Guyer; Ross C Hardison; David Haussler; Yoshihide Hayashizaki; LaDeana W Hillier; Angela Hinrichs; Wratko Hlavina; Timothy Holzer; Fan Hsu; Axin Hua; Tim Hubbard; Adrienne Hunt; Ian Jackson; David B Jaffe; L Steven Johnson; Matthew Jones; Thomas A Jones; Ann Joy; Michael Kamal; Elinor K Karlsson; Donna Karolchik; Arkadiusz Kasprzyk; Jun Kawai; Evan Keibler; Cristyn Kells; W James Kent; Andrew Kirby; Diana L Kolbe; Ian Korf; Raju S Kucherlapati; Edward J Kulbokas; David Kulp; Tom Landers; J P Leger; Steven Leonard; Ivica Letunic; Rosie Levine; Jia Li; Ming Li; Christine Lloyd; Susan Lucas; Bin Ma; Donna R Maglott; Elaine R Mardis; Lucy Matthews; Evan Mauceli; John H Mayer; Megan McCarthy; W Richard McCombie; Stuart McLaren; Kirsten McLay; John D McPherson; Jim Meldrim; Beverley Meredith; Jill P Mesirov; Webb Miller; Tracie L Miner; Emmanuel Mongin; Kate T Montgomery; Michael Morgan; Richard Mott; James C Mullikin; Donna M Muzny; William E Nash; Joanne O Nelson; Michael N Nhan; Robert Nicol; Zemin Ning; Chad Nusbaum; Michael J O'Connor; Yasushi Okazaki; Karen Oliver; Emma Overton-Larty; Lior Pachter; Genís Parra; Kymberlie H Pepin; Jane Peterson; Pavel Pevzner; Robert Plumb; Craig S Pohl; Alex Poliakov; Tracy C Ponce; Chris P Ponting; Simon Potter; Michael Quail; Alexandre Reymond; Bruce A Roe; Krishna M Roskin; Edward M Rubin; Alistair G Rust; Ralph Santos; Victor Sapojnikov; Brian Schultz; Jörg Schultz; Matthias S Schwartz; Scott Schwartz; Carol Scott; Steven Seaman; Steve Searle; Ted Sharpe; Andrew Sheridan; Ratna Shownkeen; Sarah Sims; Jonathan B Singer; Guy Slater; Arian Smit; Douglas R Smith; Brian Spencer; Arne Stabenau; Nicole Stange-Thomann; Charles Sugnet; Mikita Suyama; Glenn Tesler; Johanna Thompson; David Torrents; Evanne Trevaskis; John Tromp; Catherine Ucla; Abel Ureta-Vidal; Jade P Vinson; Andrew C Von Niederhausern; Claire M Wade; Melanie Wall; Ryan J Weber; Robert B Weiss; Michael C Wendl; Anthony P West; Kris Wetterstrand; Raymond Wheeler; Simon Whelan; Jamey Wierzbowski; David Willey; Sophie Williams; Richard K Wilson; Eitan Winter; Kim C Worley; Dudley Wyman; Shan Yang; Shiaw-Pyng Yang; Evgeny M Zdobnov; Michael C Zody; Eric S Lander
Journal: Nature Date: 2002-12-05 Impact factor: 49.962

5. The UCSC Genome Browser Database.

Authors: D Karolchik; R Baertsch; M Diekhans; T S Furey; A Hinrichs; Y T Lu; K M Roskin; M Schwartz; C W Sugnet; D J Thomas; R J Weber; D Haussler; W J Kent
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

6. Spurious spatial periodicity of co-expression in microarray data due to printing design.

Authors: Gábor Balázsi; Krin A Kay; Albert-László Barabási; Zoltán N Oltvai
Journal: Nucleic Acids Res Date: 2003-08-01 Impact factor: 16.971

7. Network motifs in integrated cellular networks of transcription-regulation and protein-protein interaction.

Authors: Esti Yeger-Lotem; Shmuel Sattath; Nadav Kashtan; Shalev Itzkovitz; Ron Milo; Ron Y Pinter; Uri Alon; Hanah Margalit
Journal: Proc Natl Acad Sci U S A Date: 2004-04-12 Impact factor: 11.205

8. MIPS: analysis and annotation of proteins from whole genomes.

Authors: H W Mewes; C Amid; R Arnold; D Frishman; U Güldener; G Mannhaupt; M Münsterkötter; P Pagel; N Strack; V Stümpflen; J Warfsmann; A Ruepp
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

9. Aggregation of topological motifs in the Escherichia coli transcriptional regulatory network.

Authors: Radu Dobrin; Qasim K Beg; Albert-László Barabási; Zoltán N Oltvai
Journal: BMC Bioinformatics Date: 2004-01-30 Impact factor: 3.169

10. The European dimension for the mouse genome mutagenesis program.

Authors: Johan Auwerx; Phil Avner; Richard Baldock; Andrea Ballabio; Rudi Balling; Mariano Barbacid; Anton Berns; Allan Bradley; Steve Brown; Peter Carmeliet; Pierre Chambon; Roger Cox; Duncan Davidson; Kay Davies; Denis Duboule; Jiri Forejt; Francesca Granucci; Nick Hastie; Martin Hrabé de Angelis; Ian Jackson; Dimitris Kioussis; George Kollias; Mark Lathrop; Urban Lendahl; Marcos Malumbres; Harald von Melchner; Werner Müller; Juha Partanen; Paola Ricciardi-Castagnoli; Peter Rigby; Barry Rosen; Nadia Rosenthal; Bill Skarnes; A Francis Stewart; Janet Thornton; Glauco Tocchini-Valentini; Erwin Wagner; Walter Wahli; Wolfgang Wurst
Journal: Nat Genet Date: 2004-09 Impact factor: 38.330

7 in total

1. Identifying Functions of Proteins in Mice With Functional Embedding Features.

Authors: Hao Li; ShiQi Zhang; Lei Chen; Xiaoyong Pan; ZhanDong Li; Tao Huang; Yu-Dong Cai
Journal: Front Genet Date: 2022-05-16 Impact factor: 4.772

2. Prediction of deleterious non-synonymous SNPs based on protein interaction network and hybrid properties.

Authors: Tao Huang; Ping Wang; Zhi-Qiang Ye; Heng Xu; Zhisong He; Kai-Yan Feng; Lele Hu; Weiren Cui; Kai Wang; Xiao Dong; Lu Xie; Xiangyin Kong; Yu-Dong Cai; Yixue Li
Journal: PLoS One Date: 2010-07-30 Impact factor: 3.240

3. Predicting functions of proteins in mouse based on weighted protein-protein interaction network and protein hybrid properties.

Authors: Lele Hu; Tao Huang; Xiaohe Shi; Wen-Cong Lu; Yu-Dong Cai; Kuo-Chen Chou
Journal: PLoS One Date: 2011-01-19 Impact factor: 3.240

4. Curation of complex, context-dependent immunological data.

Authors: Randi Vita; Kerrie Vaughan; Laura Zarebski; Nima Salimi; Ward Fleri; Howard Grey; Muthu Sathiamurthy; John Mokili; Huynh-Hoa Bui; Philip E Bourne; Julia Ponomarenko; Romulo de Castro; Russell K Chan; John Sidney; Stephen S Wilson; Scott Stewart; Scott Way; Bjoern Peters; Alessandro Sette
Journal: BMC Bioinformatics Date: 2006-07-12 Impact factor: 3.169

5. Exploring Mouse Protein Function via Multiple Approaches.

Authors: Guohua Huang; Chen Chu; Tao Huang; Xiangyin Kong; Yunhua Zhang; Ning Zhang; Yu-Dong Cai
Journal: PLoS One Date: 2016-11-15 Impact factor: 3.240

6. CORUM: the comprehensive resource of mammalian protein complexes.

Authors: Andreas Ruepp; Barbara Brauner; Irmtraud Dunger-Kaltenbach; Goar Frishman; Corinna Montrone; Michael Stransky; Brigitte Waegele; Thorsten Schmidt; Octave Noubibou Doudieu; Volker Stümpflen; H Werner Mewes
Journal: Nucleic Acids Res Date: 2007-10-26 Impact factor: 16.971

7. Protein function prediction by collective classification with explicit and implicit edges in protein-protein interaction networks.

Authors: Wei Xiong; Hui Liu; Jihong Guan; Shuigeng Zhou
Journal: BMC Bioinformatics Date: 2013-09-24 Impact factor: 3.169

7 in total