Literature DB >> 21109531

MIPS: curated databases and comprehensive secondary data resources in 2010.

H Werner Mewes¹, Andreas Ruepp, Fabian Theis, Thomas Rattei, Mathias Walter, Dmitrij Frishman, Karsten Suhre, Manuel Spannagl, Klaus F X Mayer, Volker Stümpflen, Alexey Antonov.

Abstract

The Munich Information Center for Protein Sequences (MIPS at the Helmholtz Center for Environmental Health, Neuherberg, Germany) has many years of experience in providing annotated collections of biological data. Selected data sets of high relevance, such as model genomes, are subjected to careful manual curation, while the bulk of high-throughput data is annotated by automatic means. High-quality reference resources developed in the past and still actively maintained include Saccharomyces cerevisiae, Neurospora crassa and Arabidopsis thaliana genome databases as well as several protein interaction data sets (MPACT, MPPI and CORUM). More recent projects are PhenomiR, the database on microRNA-related phenotypes, and MIPS PlantsDB for integrative and comparative plant genome research. The interlinked resources SIMAP and PEDANT provide homology relationships as well as up-to-date and consistent annotation for 38,000,000 protein sequences. PPLIPS and CCancer are versatile tools for proteomics and functional genomics interfacing to a database of compilations from gene lists extracted from literature. A novel literature-mining tool, EXCERBT, gives access to structured information on classified relations between genes, proteins, phenotypes and diseases extracted from Medline abstracts by semantic analysis. All databases described here, as well as the detailed descriptions of our projects can be accessed through the MIPS WWW server (http://mips.helmholtz-muenchen.de).

Entities: Chemical Disease Gene Species

Mesh：

Substances：
MicroRNAs

Year: 2010 PMID： 21109531 PMCID： PMC3013725 DOI： 10.1093/nar/gkq1157

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The MIPS group has provided biomolecular databases and related resources for more than 20 years (Table 1). To cope with the quest for accuracy, completeness and timeliness, manually curated databases face the problem of ever-growing amounts of data and related information. The dilemma between the effort to maintain databases curated by experts and the available resources is well known but unfortunately persistent. Manual curation has therefore to concentrate on rather specialized subjects. As an alternative, the application of computational tools and the generation of databases of secondary information can provide most of information rather efficiently. While a large number of databases are available, it is difficult to combine and integrate information although the implementation of web services has improved the situation. Therefore, we concentrate along three lines of services: (i) the development and maintenance of primary manually curated databases for a number of specialized areas of interest that are widely used as gold standards, integrating factual information and the biological knowledge as extracted from the literature; (ii) large-scale and comprehensive databases of secondary data such as (1) SIMAP, the exhaustive database of protein similarities, currently containing 38 million non-redundant protein sequences and (2) PEDANT, the comprehensive database of annotated genomes, now containing 3800 genomes; and (iii) EXCERBT, a query system for the retrieval of biological knowledge based on semantic web technology.

Table 1.

URL addresses for MIPS database resources and associated user interfaces

Project description	Link
Project overview	www.helmholtz-muenchen.de/en/mips/projects
Arabidopsis thaliana genome (MATDB)	http://mips.helmholtz-muenchen.de/plant/athal/
Complete Genomes (PEDANT server)	http://pedant.gsf.de/
Comprehensive Yeast Genome Database (CYGD)	http://mips.helmholtz-muenchen.de/genre/proj/yeast/
EXCERPT – Semantic textmining	http://mips.helmholtz-muenchen.de/geknowme/web/excerbt
FunCat: Functional Catalogue of Proteins	www.helmholtz-muenchen.de/en/mips/projects/funcat
GenRE: Genome Research Environment	www.helmholtz-muenchen.de/en/mips/projects/genre
MIPS Neurospora crassa Database (MNCDB)	http://mips.helmholtz-muenchen.de/genre/proj/ncrassa/
MOsDB: Rice Genome Database	http://mips.helmholtz-muenchen.de/plant/rice/
MPPI: Mammalian Protein–Protein Interactions	http://mips.helmholtz-muenchen.de/proj/ppi/
SIMAP: Similarity Matrix of Proteins	www.helmholtz-muenchen.de/en/mips/projects/simap
The Lotus Genome Database (Lotus japonica)	http://mips.helmholtz-muenchen.de/plant/lotus/
CORUM	http://mips.helmholtz-muenchen.de/genre/proj/corum
Medicago MT3 genome database	http://mips.helmholtz-muenchen.de/plant/medi3
FDGB: Fusarium graminearum genome database	http://mips.helmholtz-muenchen.de/genre/proj/FGDB/
MassTRIX	http://masstrix.org or http://metabolomics.helmholtz-muenchen.de/masstrix2/
metaP-Server	http://metabolomics.helmholtz-muenchen.de/metap2/
MUMDB: Ustillago maydis genome database	http://mips.helmholtz-muenchen.de/genre/proj/ustilago/
MPACT: representation of interaction data at MIPS	http://mips.helmholtz-muenchen.de/genre/proj/mpact/
PlantsDB	http://mips.helmholtz-muenchen.de/projects/plants/
PhenomiR: miRNA–phenotype relations	http://mips.helmholtz-muenchen.de/phenomir/
TICL: network-based analysis of compound lists	http://mips.helmholtz-muenchen.de/proj/cmp/

URL addresses for MIPS database resources and associated user interfaces

RECENT DEVELOPMENTS

PhenomiR: structured information on microRNA–phenotype relations

Small endogenous non-coding RNA species known as microRNAs (miRNAs) are essential for a wide variety of cellular processes in higher eukaryotes such as cell and organ development. They are directly or indirectly related to a number of diseases, including cancer. In order to collect the ever-growing body of data in a structured and uniform format, we created the PhenomiR database (1). All information in PhenomiR is extracted from published studies relating miRNAs and diseases or processes, and has been annotated manually to achieve a high quality of the resource. PhenomiR provides a comprehensive annotation of published experimental data and their origin including the mode of miRNA expression (up- or downregulation), the miRNA detection method, cohort information of patient studies and the study design. Assignment of the origin of the samples (patients or type of cell culture) allows analysing whether inhomogeneous results might stem from the differences of the physiology within the sample set. Quantitative levels of miRNA expression are also provided in a readily accessible way by PhenomiR if the data were published by the authors. This information for example allows discrimination between marginally and significantly deregulated miRNAs. The second release of PhenomiR (as of September 2010) contains data from 362 articles that describe 628 experiments. This data set includes 12 189 data points, each representing one deregulated miRNA in an experiment. A survey about the PhenomiR data set reveals that cancers are by far the most thoroughly investigated diseases (76.6%) followed by neurological (6.3%) and cardiovascular (4.7%) disorders. MiRNA-mediated gene silencing was shown to be involved in a number of cellular processes such as cell growth, cholesterol homeostasis and response to hypoxia. To our knowledge, PhenomiR is the only database that collects altered miRNA expression not only in diseases but also more generally in biological processes. The availability of both kinds of data allows comparing miRNA expression of disease and cellular process in related cell types. Granulocyte development (16 miRNAs) and multiple myeloma (69 miRNAs), for example, share 14 common deregulated miRNAs. All of these miRNAs are upregulated. As both processes are based on cell growth, a similar behaviour in regulatory processes can be expected. The fraction of miRNAs that is specific for multiple myeloma contains a majority (31 of 55) of downregulated miRNAs (Supplementary Figure S1).

MIPSPlantsDB: plant database resource for integrative and comparative plant genome research

Massive generation of plant genomics data asks for in-depth analysis and enables powerful comparative analysis. PlantsDB aims to provide data and information resources for individual plant species building a platform for integrative and comparative plant genome research. PlantsDB is constituted from genome databases for Arabidopsis, Medicago, lotus, rice, maize and tomato, barley, Brachypodium and Sorghum. Complementary data resources for repetitive elements and extensive cross-species comparison are also implemented. The ongoing projects include both model genomes (Arabidopsis thaliana, Arabidopsis lyrata, Brachypodium distachyon and Medicago truncatula) as well as important crop genomes such as barley, Sorghum, maize or tomato. Besides the need for comprehensive, structured genome and knowledge information, genome for the individual species, detailed and comprehensive comparative analysis asks for the availability of a range of plant genomes that represent a wide spectrum and evolutionary range. In addition, these information resources will help to elucidate genomic elements that have not been discovered so far or have been difficult to detect. Consistent, detailed data and structured information resources are a prerequisite for detailed and in-depth cross-species comparisons and comparative phylogenetic analysis. PlantsDB provides a highly flexible modular database infrastructure for a wide range of plant genomic data. The respective species databases are updated and new data are continuously integrated either through adjustment against external resources or via the MIPS’ participation in a range of plant genome sequencing projects. While individual organism-related databases provide an important pillar of PlantsDB, the focus of PlantsDB is extending beyond individual genomes. PlantsDB aims to make available resources that are species spanning and address and support specific questions in comparative and integrative plant genomics. Topics and resources circumvent integrated resources for the detection and analysis of repeat elements, comparative views and navigation systems. The PlantsDB resources are complemented by BioMOBY based web-service opportunities that support seamless navigation and combination of services provided by PlantsDB and partner databases worldwide.

Similarity Matrix of Proteins: exhaustive protein similarity matrix

The Similarity Matrix of Proteins (SIMAP) is an exhaustive, up-to-date database of pre-calculated similarities and features of protein sequences (2). In order to cover the publicly accessible sequence space comprehensively, SIMAP includes the sequence data from all major protein sequence databases as well as from important repositories of environmental sequences. Until 2010, the size of SIMAP has continuously increased to a recent number of 38 million non-redundant protein sequences collected from more than 80 million database entries. The resulting similarity network consists of more than 400 billion edges, connecting the proteins at the most sensitive level that can be achieved by pair-wise local protein sequence alignments. These unique data can be accessed via a user-friendly web portal, providing customizable search tools and integrating information from sequence similarity, protein domains and sequence clusters. Alternatively, SIMAP data can be freely retrieved through Web-Services by applications and resources. SIMAP thus not only speeds up traditional sequence and genome analyses based on sequence similarities, but also facilitates network-based approaches exploring the protein sequence space.

PEDANT genome database

The PEDANT genome database is a comprehensive database of automatically annotated genomes (3). The current version of PEDANT contains 3802 publicly available genomes (89 archaea, 1076 bacteria, 132 eukaryotes and 2505 viruses). It provides full coverage of the NCBI’s RefSeq database (4). New RefSeq genomes are imported monthly; already imported genomes get updated regularly. Changes of genetic elements are tracked. Obsolete elements are marked as outdated but are not deleted in order to keep inbound Web links functional. The underlying database system has been adapted for improved scalability and now uses optimized data types in combination with index and data compression. This enables cross-genome queries and increases retrieval performance. A new pathway coverage method has been added which allows determining metabolic pathways present in a given genome based on computationally derived EC numbers. Integration with other MIPS databases and external resources has been improved. PEDANT proteins are now directly linked to SIMAP and to GBrowse databases hosted at MIPS (i.e. Ustilago maydis).

Metabolomics@MIPS

The enormous amount of data produced by modern kit-based high-throughput metabolomics experiments poses new challenges regarding their biological interpretation in the context of various sample phenotypes. We currently provide three web-based data-analysis tools for metabolomics. metaP-server for high-throughput targeted metabolomics, MassTRIX, (5) for deep-drilling non-targeted metabolomics and TICL (6) for network-based interpretation of a compound list. All servers provide results that link-out to dedicated metabolomics databases such as KEGG (7), HMDB (8), LIPID MAPS (9) and a hand-curated in-house database with a specific focus on kit-based lipidomics read-outs.

EXCERBT: a database of biological object relations built from biomedical texts

The current databases represent factual information, either as primary data collections or derived information. Information derived from factual observations is to the largest extent buried in the scientific literature. As it is more and more impossible to follow all publications in a certain research area, the automatic analysis of free text and its transformation into structured databases become a necessity. Specialized text mining (TM) systems such as GIN (10) or RLIMS-P (11) have been published. Few systems have a broader scope integrating text resources, relation and entity types. EXCERBT (‘extraction of classified entities and relations from biomedical texts’) is a database of extracted semantic relations from MEDLINE abstracts for the retrieval of a multitude of relation types such as ‘activation’, ‘inhibition’ or ‘phosphorylation’ and a broad range of biomedical entity types. The subject-centric results are presented usually within a few seconds, although the system covers MEDLINE and several smaller sources of literature. Due to the multitude of integrated relation and entity types and the possibility to submit the queries in passive as well as active voice, EXCERBT is a versatile resource to explore the giant combinatorial space between semantically associated biological entities. It contains ∼4 billion relations; EXCERBT is suited to build large-, medium- and small-scale qualitative networks suitable for the systematic exploration of topic-related information (Figure 1).

Figure 1.

EXCERBT interface: (i) “which phenotypes are caused by the wrn gene?”; (ii) list of gene–phenotype pairs and detailed literature evidence; and (iii) graphical representation of selected relations.

PLIPS and CCANCER: a database of gene/protein lists reported in experimental studies in various functional contexts

‘Omics’ technologies provide a spectrum of methods applied in many fields of cellular and molecular biology such as the identification of diagnostic biomarkers, monitoring the effects of drug treatments or profiling transcripts related to onset and progress of diseases. Although very different with respect to the biological system tested, the primary results of the majority of ‘omics’ studies are gene/protein list. Several thousands of independent experimental studies have reported such lists. Although being publicly available, this valuable information was dissolved in hundreds of papers and was not accessible for automatic analysis. To render this information available we collected this type of information by searching through full text papers. We automatically selected tables, which report a list of gene or protein identifiers. PLIPS (12) is a database of protein lists reported previously by proteomics experimental studies. At the moment PLIPS covers in about 1500 protein lists reported previously by ∼1200 proteomics publications. CCancer (13) is a database of gene lists, which were reported mostly by experimental studies in various biological and clinical settings. At the moment, the database covers 3500 gene lists extracted from 2800 papers published in about 100 peer-reviewed journals. Both databases provide user-friendly computational services. Users can query his/her list of gene/protein identifiers to find a catalogue of previously published studies that report a table of genes/proteins that significantly intersect with a query list. To understand the functional context of a experimentally identified gene/protein list, MIPS provides a comprehensive collection of online tools for functional profiling. The underlying database covers nearly all available information regarding gene function [Gene Ontology (9;14)], protein interactions [IntAct (15)] and pathway relationships [Reactome (16), KEGG (7)]. The tools implemented robust statistical frameworks and provide computational interfaces for most available public databases. Regarding statistical methodology employed, available web tools can be divided into two categories. The first group [ProfCom (17), PLIPS (12), CCancer (13), GeneSet2MiRNA (18)] employs a modified enrichment analyses schema (19). The second group [KEGG spider (20), R spider (21), PPI spider (22)] implements a novel statistical methodology for the network-based interpretation of a gene list. Finally, GeneSet2MiRNA provides statistical information whether or not a query gene list has a signature of miRNA regulatory activity.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

The Federal Ministry of Education, Science, Research and Technology (BMBF: GABI Barlex: 0314000C; SysMBo: 0315494A); the European Commission (Framework 6 & 7; Grain Legumes Integrative Project and the Triticeae Genome Project); and the Helmholtz Alliance Systems Biology (CoReNe). Funding for open access charge: Helmholtz Center for Health and Environment. Conflict of interest statement. None declared.

22 in total

1. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

2. An online literature mining tool for protein phosphorylation.

Authors: X Yuan; Z Z Hu; H T Wu; M Torii; M Narayanaswamy; K E Ravikumar; K Vijay-Shanker; C H Wu
Journal: Bioinformatics Date: 2006-04-27 Impact factor: 6.937

3. Complex functionality of gene groups identified from high-throughput data.

Authors: Alexey V Antonov; Hans W Mewes
Journal: J Mol Biol Date: 2006-07-29 Impact factor: 5.469

4. PLIPS, an automatically collected database of protein lists reported by proteomics studies.

Authors: Alexey V Antonov; Sabine Dietmann; Philip Wong; Rodchenkov Igor; Hans W Mewes
Journal: J Proteome Res Date: 2009-03 Impact factor: 4.466

5. PPI spider: a tool for the interpretation of proteomics data in the context of protein-protein interaction networks.

Authors: Alexey V Antonov; Sabine Dietmann; Igor Rodchenkov; Hans W Mewes
Journal: Proteomics Date: 2009-05 Impact factor: 3.984

6. TICL--a web tool for network-based interpretation of compound lists inferred by high-throughput metabolomics.

Authors: Alexey V Antonov; Sabine Dietmann; Philip Wong; Hans W Mewes
Journal: FEBS J Date: 2009-04 Impact factor: 5.542

7. HMDB: a knowledgebase for the human metabolome.

Authors: David S Wishart; Craig Knox; An Chi Guo; Roman Eisner; Nelson Young; Bijaya Gautam; David D Hau; Nick Psychogios; Edison Dong; Souhaila Bouatra; Rupasri Mandal; Igor Sinelnikov; Jianguo Xia; Leslie Jia; Joseph A Cruz; Emilia Lim; Constance A Sobsey; Savita Shrivastava; Paul Huang; Philip Liu; Lydia Fang; Jun Peng; Ryan Fradette; Dean Cheng; Dan Tzur; Melisa Clements; Avalyn Lewis; Andrea De Souza; Azaret Zuniga; Margot Dawe; Yeping Xiong; Derrick Clive; Russ Greiner; Alsu Nazyrova; Rustem Shaykhutdinov; Liang Li; Hans J Vogel; Ian Forsythe
Journal: Nucleic Acids Res Date: 2008-10-25 Impact factor: 16.971

8. Reactome knowledgebase of human biological pathways and processes.

Authors: Lisa Matthews; Gopal Gopinath; Marc Gillespie; Michael Caudy; David Croft; Bernard de Bono; Phani Garapati; Jill Hemish; Henning Hermjakob; Bijay Jassal; Alex Kanapin; Suzanna Lewis; Shahana Mahajan; Bruce May; Esther Schmidt; Imre Vastrik; Guanming Wu; Ewan Birney; Lincoln Stein; Peter D'Eustachio
Journal: Nucleic Acids Res Date: 2008-11-03 Impact factor: 16.971

9. KEGG spider: interpretation of genomics data in the context of the global gene metabolic network.

Authors: Alexey V Antonov; Sabine Dietmann; Hans W Mewes
Journal: Genome Biol Date: 2008-12-18 Impact factor: 13.583

10. LIPID MAPS online tools for lipid research.

Authors: Eoin Fahy; Manish Sud; Dawn Cotter; Shankar Subramaniam
Journal: Nucleic Acids Res Date: 2007-06-21 Impact factor: 16.971

35 in total

1. Genome-scale analysis of interaction dynamics reveals organization of biological networks.

Authors: Jishnu Das; Jaaved Mohammed; Haiyuan Yu
Journal: Bioinformatics Date: 2012-05-09 Impact factor: 6.937

2. Optimization criteria and biological process enrichment in homologous multiprotein modules.

Authors: Luqman Hodgkinson; Richard M Karp
Journal: Proc Natl Acad Sci U S A Date: 2013-06-11 Impact factor: 11.205

3. Selective Neuronal Vulnerability in Alzheimer's Disease: A Network-Based Analysis.

Authors: Jean-Pierre Roussarie; Vicky Yao; Patricia Rodriguez-Rodriguez; Rose Oughtred; Jennifer Rust; Zakary Plautz; Shirin Kasturia; Christian Albornoz; Wei Wang; Eric F Schmidt; Ruth Dannenfelser; Alicja Tadych; Lars Brichta; Alona Barnea-Cramer; Nathaniel Heintz; Patrick R Hof; Myriam Heiman; Kara Dolinski; Marc Flajolet; Olga G Troyanskaya; Paul Greengard
Journal: Neuron Date: 2020-06-29 Impact factor: 17.173

4. Categorizing biases in high-confidence high-throughput protein-protein interaction data sets.

Authors: Xueping Yu; Joseph Ivanic; Vesna Memisević; Anders Wallqvist; Jaques Reifman
Journal: Mol Cell Proteomics Date: 2011-08-29 Impact factor: 5.911

Review 5. Popular computational methods to assess multiprotein complexes derived from label-free affinity purification and mass spectrometry (AP-MS) experiments.

Authors: Irina M Armean; Kathryn S Lilley; Matthew W B Trotter
Journal: Mol Cell Proteomics Date: 2012-10-15 Impact factor: 5.911

6. INstruct: a database of high-quality 3D structurally resolved protein interactome networks.

Authors: Michael J Meyer; Jishnu Das; Xiujuan Wang; Haiyuan Yu
Journal: Bioinformatics Date: 2013-04-18 Impact factor: 6.937

7. IHP-PING-generating integrated human protein-protein interaction networks on-the-fly.

Authors: Gaston K Mazandu; Christopher Hooper; Kenneth Opap; Funmilayo Makinde; Victoria Nembaware; Nicholas E Thomford; Emile R Chimusa; Ambroise Wonkam; Nicola J Mulder
Journal: Brief Bioinform Date: 2021-07-20 Impact factor: 11.622

Review 8. Exploring mechanisms of human disease through structurally resolved protein interactome networks.

Authors: Robert Fragoza; Hao Ran Lee; Nicolas A Cordero; Jishnu Das; Yu Guo; Michael J Meyer; Tommy V Vo; Xiujuan Wang; Haiyuan Yu
Journal: Mol Biosyst Date: 2014-01

9. Cross-species protein interactome mapping reveals species-specific wiring of stress response pathways.

Authors: Jishnu Das; Tommy V Vo; Xiaomu Wei; Joseph C Mellor; Virginia Tong; Andrew G Degatano; Xiujuan Wang; Lihua Wang; Nicolas A Cordero; Nathan Kruer-Zerhusen; Akihisa Matsuyama; Jeffrey A Pleiss; Steven M Lipkin; Minoru Yoshida; Frederick P Roth; Haiyuan Yu
Journal: Sci Signal Date: 2013-05-21 Impact factor: 8.192

10. An interactome perturbation framework prioritizes damaging missense mutations for developmental disorders.

Authors: Siwei Chen; Robert Fragoza; Lambertus Klei; Yuan Liu; Jiebiao Wang; Kathryn Roeder; Bernie Devlin; Haiyuan Yu
Journal: Nat Genet Date: 2018-06-11 Impact factor: 38.330