Literature DB >> 19884131

CORUM: the comprehensive resource of mammalian protein complexes--2009.

Andreas Ruepp¹, Brigitte Waegele, Martin Lechner, Barbara Brauner, Irmtraud Dunger-Kaltenbach, Gisela Fobo, Goar Frishman, Corinna Montrone, H-Werner Mewes.

Abstract

CORUM is a database that provides a manually curated repository of experimentally characterized protein complexes from mammalian organisms, mainly human (64%), mouse (16%) and rat (12%). Protein complexes are key molecular entities that integrate multiple gene products to perform cellular functions. The new CORUM 2.0 release encompasses 2837 protein complexes offering the largest and most comprehensive publicly available dataset of mammalian protein complexes. The CORUM dataset is built from 3198 different genes, representing approximately 16% of the protein coding genes in humans. Each protein complex is described by a protein complex name, subunit composition, function as well as the literature reference that characterizes the respective protein complex. Recent developments include mapping of functional annotation to Gene Ontology terms as well as cross-references to Entrez Gene identifiers. In addition, a 'Phylogenetic Conservation' analysis tool was implemented that analyses the potential occurrence of orthologous protein complex subunits in mammals and other selected groups of organisms. This allows one to predict the occurrence of protein complexes in different phylogenetic groups. CORUM is freely accessible at (http://mips.helmholtz-muenchen.de/genre/proj/corum/index.html).

Entities: Disease Gene Species

Mesh：

Substances：
Multiprotein Complexes

Year: 2009 PMID： 19884131 PMCID： PMC2808912 DOI： 10.1093/nar/gkp914

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Major cellular processes like cell cycle, protein folding and protein degradation depend on the activity of protein complexes (1). To date there are no reliable estimates about the total number of protein complexes in cells (complexome), but data from single cell organisms provide evidence, that more than half of the gene products are involved in the formation of protein complexes (2). In the advent of protein network analyses, topological properties of protein complexes resulted in paraphrases such as ‘party hubs’ (3) or ‘multi-interface hubs’ (4). Bioinformatics analysis of protein–protein interaction (PPI) datasets revealed that protein complex subunits are stronger evolutionary conserved and show a higher essentiality than proteins from other interactions (4). As the most comprehensive PPI and protein complex data are available for Saccharomyces cerevisiae, most of these discoveries were obtained using data from yeast. In addition to a manually curated dataset of protein complexes (5), tag-based high-throughput approaches were performed in order to define the yeast complexome (6,7). The importance of manually curated gold-standards was demonstrated by analyses of results from high-throughput experiments. In an assessment of different high-throughput technologies for the analysis of PPIs it was shown, that each method, depending on its physiochemical constraints, captures interactions for different subsets of proteins (8). Thus, none of the existing methods is able to detect all interactions and it was also shown that even the combined dataset of five different methods missed ∼40% of experimentally validated, manually curated interactions (9). For mammals no comprehensive high-throughput dataset of protein complexes is publicly available. Bioinformatics analyses of the mammalian complexome can be performed either by using artificially constructed protein complexes (10) or data from manually curated datasets (11,12). In 2008, the CORUM database was introduced as the most comprehensive catalogue of mammalian protein complexes. All data are manually curated including information of protein complex subunits and methods of purification as well as additional information such as functional annotation using the Functional Catalogue (FunCat) annotation scheme (13), stoichiometry of the subunits and information about association with diseases (14). Analyses of the CORUM dataset have shown (i) that mammalian protein complexes are most frequently composed of 3 or 4 different subunits and (ii) that proteins tend to be reused in up to 53 protein complexes (15). The CORUM dataset has been used for a number of bioinformatics analyses like tissue-specific expression of proteins (16), functional interpretation of high-throughput data (17–19) or to predict interactions of protein regions (20). In addition, the dataset contributes to web-based applications like the DICS database of functional modules (21) or the COFECO tool for composite function annotation (22). The CORUM Release 2.0 presents a significantly extended dataset that now consists of 2837 mammalian protein complexes. In addition to existing cross-references the dataset was mapped to Entrez Gene identifiers and functional annotation of Gene Ontology (GO) terms. In order to enable more specific search results in comments, the content is now distributed into the three sections ‘Disease Comment’, ‘Functional Comment’ and ‘Subunit Comment’. Finally, an analysis tool was implemented that allows one to predict the occurrence of orthologous protein complex subunits in other mammals and other groups of organisms. The ‘Phylogenetic Conservation’ tool provides a probability whether or not a protein complex is likely to occur in the analysed model organisms. CORUM is freely accessible at http://mips.helmholtz-muenchen.de/genre/proj/corum/index.html.

NEW DEVELOPMENTS

Dataset and cross-references

In 2008 the CORUM dataset consisted of 1750 mammalian protein complexes, mainly characterized in human (60%), mouse (14%) and rat (14%) (14). While the relative abundance of the related organisms remained stable in the meantime, the number of protein complexes has grown to 2837 in September 2009. Thus, CORUM is the largest set of mammalian protein complexes publicly available. However, compared to data from single-cell organisms only a minor fraction of the mammalian complexome has been discovered so far. Data from yeast have shown that at least 45% of the gene complement function as subunits in protein complexes (14). Considering that there is no comprehensive mammalian high-throughput dataset available to date, the fraction of genes that are involved in protein complex formation is comparably low. These estimates are based on the number of different complex subunit genes divided by a given number of 20 488 genes in human (14). Compared to the first CORUM release, this fraction increased moderately from 12% (2400 genes) to 16% (3198 genes). The slow increase of novel protein complex subunits presumably results from the reuse of subunits (Figure 1) in different protein complexes or protein complex variants (15). Data from the CORUM ‘Core Set’ (see below) show that proteins like ‘integrin beta-1’, ‘histone deacetylase 1’ and ‘histone deacetylase 2’ appear in 54, 51 and 38 different human protein complexes. Multiple reutilization of protein complex subunits is particularly found in large protein complex families like SNARE complexes and ubiquitin E3 ligases. The ubiquitin E3 ligase subunit ring-box 1 (Rbx1), for example, was identified in 35 complexes.

Figure 1.

Reutilization of protein complex subunits. The plot shows that most proteins (2038) are found in only one protein complex and only eight proteins are subunits of at least 30 protein complexes. Data for the analysis are based on the CORUM ‘Core Set’. In addition to the complete dataset, CORUM now offers a reduced ‘Core Dataset’ for download and searches that avoids redundancies of data. Thoroughly investigated protein complexes like ‘SNARE complex (Vamp2, Snap25, Stx1a, Cplx1)’, ‘succinyl-CoA synthetase, ADP-forming’ and ‘cytochrome bc1-complex (EC 1.10.2.2), mitochondrial’ are characterized in more than one mammalian organism. Due to the close phylogenetic relationship between mammals it can be assumed that the majority of protein complexes are conserved in mammals. However, as the aim of CORUM is to provide a comprehensive dataset, also evolutionary conserved protein complexes from different organisms (interologous protein complexes) are annotated in CORUM. To some extent this introduces redundancies, but on the other hand proves that the same protein complex in fact exists in different organisms. Results from several laboratories that investigated the same protein complex but characterized the molecule with a different composition are another source of dataset expansion. These may stem from different experimental conditions that result in different complex compositions depending on the stringency of the experimental procedures or from different biomaterial that was used for the characterization. Bioinformatics applications like machine learning require non-redundant datasets. For these users we offer the ‘Core Set’ of 2084 distinct protein complexes. For the set only one representative of each interologous group of protein complexes or from protein complex variants was selected. We chose protein complexes which were thoroughly characterized and preferably from Homo sapiens. Annotation of protein complex subunits in CORUM is performed with UniProt identifiers. Since some users prefer identifiers from Entrez Gene, we mapped the UniProt identifiers to the corresponding Entrez Gene identifiers. This was realized in a semi-automatic procedure using the CRONOS tool (23). CRONOS allows the mapping of identifiers, gene names and protein names from various resources like UniProt, RefSeq and Ensembl. In total, 4310 out of 4336 distinct subunits (98%) could be mapped to corresponding Entrez Gene identifiers. For 26 gene products like MRPS15 from Bos taurus or SPCS1 from Canis familiaris no respective identifier was available in Entrez. CORUM is the only resource of protein complexes that includes functional annotation of the molecules. We use the FunCat annotation scheme for protein and protein complex function characterization (13). The FunCat has been used for genome annotation and was also frequently used for the analysis of protein networks and high-throughput experiments (13). The hierarchical structure of the FunCat allows browsing for protein complexes with particular cellular functions or localizations. In recent years, GO has become a widely used tool for the annotation of eukaryotic genomes (24). In contrast to the FunCat annotation scheme, the GO is constructed as a set of acyclic graphs, allowing more than one parent class per child (24). In order to enable bioinformatics analyses of protein complexes based on GO terms, the new CORUM release provides a mapping from FunCat to GO. The mapping was performed using the table that is available for download at http://www.geneontology.org/external2go/mips2go. As a result 840 FunCat categories could be mapped to 896 GO terms. Manual inspection of 100 randomly chosen protein complexes revealed that FunCat categories and GO terms are in agreement. Some valuable information concerning protein complexes cannot be covered by systematic annotation schemes but is represented as free text comment in CORUM. This information includes protein complex composition (e.g. additional subunits of unknown identity), association of protein complexes with diseases or particular functional properties. In the first CORUM release this additional information was collected in a single comment field. In CORUM release 2.0 this content is now distributed among the three comment fields ‘Functional Comment’, ‘Disease Comment’ and ‘Subunit Comment’. This separation allows to search in a particular type of information or using a wild card ‘_’ for instance to retrieve all 223 protein complexes with information about disease association.

Phylogenetic analysis of protein complexes

Protein complex subunits from protein complexes like ribosomes and chaperonins are highly conserved in evolution. Beside ribosomal RNAs, subunits from complexes such as RNA polymerases (25) and F1-ATPases (26) were used for phylogenetic analyses in the early days of sequence-based phylogenetic analyses. Based on data from 191 sequenced genomes, 2 years ago a novel endeavor was started to investigate highly conserved proteins for phylogenetic analysis (27). Analysis revealed 31 highly conserved proteins that allow a new reconstruction of the tree of life and 28 of these proteins are known to be protein complex subunits (23 ribosomal proteins). To enable scientists to obtain some insight into the phylogenetic conservation of subunits, the ‘Phylogenetic Conservation’ tool has been developed for comparative proteome analysis. The ‘Phylogenetic Conservation’ tool is based on sequence similarity data that are obtained from the SIMAP database (28). The Similarity Matrix of Proteins (SIMAP) database provides a comprehensive and up-to-date dataset of the pre-calculated sequence similarity matrix and sequence-based features like InterPro domains for all proteins contained in the major public sequence databases. The ‘Phylogenetic Conservation’ tool in CORUM presents the similarity of the protein complex subunits to proteins from other organisms as tables (Figure 2). As default comparison to 18 organisms are shown, four mammals (Homo sapiens, Mus musculus, Rattus norvegicus and Bos taurus), three other vertebrates (Xenopus laevis, Danio rerio and Takifugu rubripes), two invertebrates (Caenorhabditis elegans and Drosophila melanogaster), two plants (Arabidopsis thaliana and Oryza sativa), three fungi (Neurospora crassa, Schizosaccharomyces pombae and S. cerevisiae), one slime mold (Dictyostelium discoideum) and three prokaryotes (Thermoplasma acidophilum, Escherichia coli and Bacillus subtilis). In addition to the numerical values, the degree of protein sequence similarity is colour coded.

Figure 2.

Phylogenetic conservation of proteasome regulatory protein complexes. Results of the phylogenetic conservation tool from CORUM for the three proteasome regulators ‘Modulator’, ‘PA28 gamma complex’ and ‘PA28 complex’ are shown. Similarity of protein complex subunits to proteins from other organisms are represented color coded as well as opt. score/self score ratios. The data are obtained from the SIMAP database. The conservation of protein complexes appears to be conserved among all phylogenetic related organisms and separates organisms of distant phylogenetic relation, depending on the respective complex. This can be illustrated with the proteasome and three proteasome activatory complexes. Two subunits of the ‘Modulator (PA700-dependent proteasome activator)’ are highly conserved (red colour) within all eukaryotes, whereas the ‘PA28 gamma complex’ is only highly conserved within vertebrates (Figure 2). Finally, high conservation of the ‘11 S REG complex’ is restricted to the four mammalian proteomes. The 20 S proteasome complex is a high-molecular-weight protease that is essential for protein degradation in mammals. Results of the ‘Phylogenetic Conservation’ tool reveal weak similarity for proteins in the archaeon T. acidophilum (Supplementary Figure S1). In fact, an archetype of proteasomes, consisting of only two different subunits is frequently found in archaea (29). On the other hand, sophisticated proteasome architectures like the 26 S proteasome or the availability of several proteasome activatory complexes are not found in Thermoplasma or other prokaryotes. In agreement with this observation, the three above mentioned proteasome activators show no similarity to proteins from Thermoplasma (Figure 2). Results of the ‘Phylogenetic Conservation’ tool can be retrieved for single protein complexes or for multiple complexes that were found by one of the search options in CORUM.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

ERA-NET PathoGenoMics ‘Pathomics’ grant (BMBF) (to B.W.). Funding to open access charge: Helmholtz Center Munich (Helmholtz Zentrum München). Conflict of interest statement. None declared.

29 in total

1. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

2. BIND: the Biomolecular Interaction Network Database.

Authors: Gary D Bader; Doron Betel; Christopher W V Hogue
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

3. Evidence for dynamically organized modularity in the yeast protein-protein interaction network.

Authors: Jing-Dong J Han; Nicolas Bertin; Tong Hao; Debra S Goldberg; Gabriel F Berriz; Lan V Zhang; Denis Dupuy; Albertha J M Walhout; Michael E Cusick; Frederick P Roth; Marc Vidal
Journal: Nature Date: 2004-06-09 Impact factor: 49.962

4. The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes.

Authors: Andreas Ruepp; Alfred Zollner; Dieter Maier; Kaj Albermann; Jean Hani; Martin Mokrejs; Igor Tetko; Ulrich Güldener; Gertrud Mannhaupt; Martin Münsterkötter; H Werner Mewes
Journal: Nucleic Acids Res Date: 2004-10-14 Impact factor: 16.971

5. The cell as a collection of protein machines: preparing the next generation of molecular biologists.

Authors: B Alberts
Journal: Cell Date: 1998-02-06 Impact factor: 41.582

Review 6. Eubacterial proteasomes.

Authors: A Lupas; F Zühl; T Tamura; S Wolf; I Nagy; R De Mot; W Baumeister
Journal: Mol Biol Rep Date: 1997-03 Impact factor: 2.316

7. Evolutionary relationship of archaebacteria, eubacteria, and eukaryotes inferred from phylogenetic trees of duplicated genes.

Authors: N Iwabe; K Kuma; M Hasegawa; S Osawa; T Miyata
Journal: Proc Natl Acad Sci U S A Date: 1989-12 Impact factor: 11.205

8. Archaebacterial DNA-dependent RNA polymerases testify to the evolution of the eukaryotic nuclear genome.

Authors: G Pühler; H Leffers; F Gropp; P Palm; H P Klenk; F Lottspeich; R A Garrett; W Zillig
Journal: Proc Natl Acad Sci U S A Date: 1989-06 Impact factor: 11.205

9. SIMAP: the similarity matrix of proteins.

Authors: Thomas Rattei; Roland Arnold; Patrick Tischler; Dominik Lindner; Volker Stümpflen; H Werner Mewes
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

10. CYGD: the Comprehensive Yeast Genome Database.

Authors: U Güldener; M Münsterkötter; G Kastenmüller; N Strack; J van Helden; C Lemer; J Richelles; S J Wodak; J García-Martínez; J E Pérez-Ortín; H Michael; A Kaps; E Talla; B Dujon; B André; J L Souciet; J De Montigny; E Bon; C Gaillardin; H W Mewes
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

350 in total

1. The role of miRNAs in complex formation and control.

Authors: Wilson Wen Bin Goh; Hirotaka Oikawa; Judy Chia Ghee Sng; Marek Sergot; Limsoon Wong
Journal: Bioinformatics Date: 2011-12-16 Impact factor: 6.937

2. Community of protein complexes impacts disease association.

Authors: Qianghu Wang; Weisha Liu; Shangwei Ning; Jingrun Ye; Teng Huang; Yan Li; Peng Wang; Hongbo Shi; Xia Li
Journal: Eur J Hum Genet Date: 2012-05-02 Impact factor: 4.246

3. Mapping the protein interaction network of the human COP9 signalosome complex using a label-free QTAX strategy.

Authors: Lei Fang; Robyn M Kaake; Vishal R Patel; Yingying Yang; Pierre Baldi; Lan Huang
Journal: Mol Cell Proteomics Date: 2012-04-03 Impact factor: 5.911

4. Algorithm to identify frequent coupled modules from two-layered network series: application to study transcription and splicing coupling.

Authors: Wenyuan Li; Chao Dai; Chun-Chi Liu; Xianghong Jasmine Zhou
Journal: J Comput Biol Date: 2012-06 Impact factor: 1.479

5. ncRDeathDB: A comprehensive bioinformatics resource for deciphering network organization of the ncRNA-mediated cell death system.

Authors: Deng Wu; Yan Huang; Juanjuan Kang; Kongning Li; Xiaoman Bi; Ting Zhang; Nana Jin; Yongfei Hu; Puwen Tan; Lu Zhang; Ying Yi; Wenjun Shen; Jian Huang; Xiaobo Li; Xia Li; Jianzhen Xu; Dong Wang
Journal: Autophagy Date: 2015 Impact factor: 16.016

6. Modeling gene-wise dependencies improves the identification of drug response biomarkers in cancer studies.

Authors: Olga Nikolova; Russell Moser; Christopher Kemp; Mehmet Gönen; Adam A Margolin
Journal: Bioinformatics Date: 2017-05-01 Impact factor: 6.937

7. BraInMap Elucidates the Macromolecular Connectivity Landscape of Mammalian Brain.

Authors: Reza Pourhaghighi; Peter E A Ash; Sadhna Phanse; Florian Goebels; Lucas Z M Hu; Siwei Chen; Yingying Zhang; Shayne D Wierbowski; Samantha Boudeau; Mohamed T Moutaoufik; Ramy H Malty; Edyta Malolepsza; Kalliopi Tsafou; Aparna Nathan; Graham Cromar; Hongbo Guo; Ali Al Abdullatif; Daniel J Apicco; Lindsay A Becker; Aaron D Gitler; Stefan M Pulst; Ahmed Youssef; Ryan Hekman; Pierre C Havugimana; Carl A White; Benjamin C Blum; Antonia Ratti; Camron D Bryant; John Parkinson; Kasper Lage; Mohan Babu; Haiyuan Yu; Gary D Bader; Benjamin Wolozin; Andrew Emili
Journal: Cell Syst Date: 2020-04-22 Impact factor: 10.304

8. Agonist-specific Protein Interactomes of Glucocorticoid and Androgen Receptor as Revealed by Proximity Mapping.

Authors: Joanna K Lempiäinen; Einari A Niskanen; Kaisa-Mari Vuoti; Riikka E Lampinen; Helka Göös; Markku Varjosalo; Jorma J Palvimo
Journal: Mol Cell Proteomics Date: 2017-06-13 Impact factor: 5.911

9. Structure-aided prediction of mammalian transcription factor complexes in conserved non-coding elements.

Authors: Harendra Guturu; Andrew C Doxey; Aaron M Wenger; Gill Bejerano
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2013-11-11 Impact factor: 6.237

10. Computational framework for analysis of prey-prey associations in interaction proteomics identifies novel human protein-protein interactions and networks.

Authors: Sudipto Saha; Jean-Eudes Dazard; Hua Xu; Rob M Ewing
Journal: J Proteome Res Date: 2012-08-21 Impact factor: 4.466