Literature DB >> 17135207

CellCircuits: a database of protein network models.

H Craig Mak¹, Mike Daly, Bianca Gruebel, Trey Ideker.

Abstract

CellCircuits (http://www.cellcircuits.org) is an open-access database of molecular network models, designed to bridge the gap between databases of individual pairwise molecular interactions and databases of validated pathways. CellCircuits captures the output from an increasing number of approaches that screen molecular interaction networks to identify functional subnetworks, based on their correspondence with expression or phenotypic data, their internal structure or their conservation across species. This initial release catalogs 2019 computationally derived models drawn from 11 journal articles and spanning five organisms (yeast, worm, fly, Plasmodium falciparum and human). Models are available either as images or in machine-readable formats and can be queried by the names of proteins they contain or by their enriched biological functions. We envision CellCircuits as a clearinghouse in which theorists may distribute or revise models in need of validation and experimentalists may search for models or specific hypotheses relevant to their interests. We demonstrate how such a repository of network models is a novel systems biology resource by performing several meta-analyses not currently possible with existing databases.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2006 PMID： 17135207 PMCID： PMC1751555 DOI： 10.1093/nar/gkl937

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

At present, a great deal of biological information is represented as interactions between molecules. This information includes physical interactions that occur among proteins, DNA, RNA and small molecules (1–3); genetic interactions such as synthetic lethality, enhancement or suppression (4); and interactions due to co-expression (5) or co-citation (6). Modern analyses of interaction data typically accomplish two goals. The first goal is to clean the data, by filtering erroneous interactions that can be associated with high-throughput screens [false positives, e.g. (7,8)] or by predicting new interactions that may have been previously missed [false negatives, e.g. (9,10)]. The second goal is to organize the interactions into biological network models—i.e. collections of interactions hypothesized to work together towards a particular cellular function or within a common pathway (11–13). Interaction analysis is currently supported by two types of available databases (Figure 1). First, the raw material for analysis is provided by databases of molecular interactions including the Database of Interacting Proteins (14), the Munich Center for Information on Protein Sequences (15), the Biomolecular Interaction Network Database (16), the BioGRID (17) and IntAct (18). Many of these databases provide confidence scores with each measured and predicted interaction. Second, there are a growing number of so-called pathway databases, in which canonical diagrams of metabolic, signaling or regulatory pathways have been hand-curated from review articles and textbooks. Metabolic pathways are the focus of Reactome (19), MetaCyc (20) and the Kyoto Encyclopedia of Genes and Genomes (21), while databases such as BioCarta (), CellMap (), the Signal Transduction Knowledge Environment (22), GeNet (23) and TransPATH (24) are primarily concerned with signaling and transcription. All of these pathway databases are relevant to the second and perhaps ultimate goal of interaction analysis—models of well-defined and well-validated functional relationships among genes, proteins and/or metabolites.

Figure 1

The need for a new type of database. The CellCircuits database is positioned between raw molecular interaction databases (left) and databases of rigorously validated cellular pathways (right). Interaction database icons represent (clockwise from top left) the Database of Interacting Proteins [DIP (14)]; the General Repository of Interaction Datasets [GRID (17)]; Molecular INTeractions Database [MINT (48)]; the IntAct molecular interactions database (18); the interaction database at the Munich Information Center for Protein Sequences [MIPS (15)]; and Biomolecular Interaction Network Database [BIND (16)]. Pathway database icons represent Reactome (19); Signal Transduction Knowledge Environment [STKE (22)]; Gene Networks database [GeNet (23)]; BioCarta (); Kyoto Encyclopedia of Genes and Genomes [KEGG (21)]; and CellMap (). Automatic inference of accurate and detailed molecular pathways, however, is well beyond the capability of current interaction analyses and integrative modeling approaches. Although current approaches attempt to place interactions into subnetworks according to their putative function (11–13), such subnetworks are hypothetical in nature and thus inappropriate for entry into any of the existing databases of canonical pathways. Rather, the subnetwork models produced by automated approaches are typically embedded in figures, tables or supplementary information in the primary published literature. Although it is certainly possible to read about the models, there are several problems with this traditional method of dissemination. First, the size and number of models from even a single publication can be overwhelming, making models relevant to a particular gene or function difficult to locate. Second, in many cases, network modeling papers target bioinformatic, rather than biological or medical, audiences. As a result, the models remain largely inaccessible to those who have the most knowledge to interpret them and the most to gain from their successful interpretation. Recent opinion articles (25,26) have recognized a related problem for the case of protein functional predictions, calling for a clearinghouse of hypotheses generated by bioinformatics analyses and searchable by experimental biologists. In the same vein, the BioModels Database (27) has recently been adopted as a working repository for simulations of kinetic quantitative systems based on ordinary differential equations. Subnetworks inferred from genome-scale data, however, do not fall into this category. Motivated by these considerations, we have designed CellCircuits as an open-access general repository of models distilled from protein networks. By aggregating models derived from many separate studies into a single resource, CellCircuits bridges the gap between databases of individual pairwise interactions and fully curated, biologically validated pathway models. The CellCircuits database enables experimentalists to readily access and cross-reference models across multiple publications. It also enables the meta-analysis of the entire set of models to reveal inter-model relationships and to answer global questions; for instance, which models overlap in terms of the genes and/or cellular processes represented? How novel is a new result given the models that are already present in the database?

MATERIALS AND METHODS

Data processing

A data processing pipeline was used to extract information from the textual representation of a model and store that information in a MySQL () relational database. The data processing pipeline required a digital image of each model and a text file containing the genes, proteins, metabolites, other small molecules and interconnections represented in the model. In cases when a network model was published in graphical form only, the text file was manually transcribed (see Supplementary Table S1). To ensure that the CellCircuits database used a consistent set of gene identifiers, we mapped each gene name found in the text file for a model to a Gene Ontology (GO) gene id using tables from the GO database. Gene names found in a model but not in the GO database were automatically inserted into the appropriate database tables and flagged as being externally added. Future curation efforts could be directed towards handling these genes missing from the GO database. After models were entered into the database, they were scored using the hypergeometric test for GO annotation enrichment.

Web interface

We used Perl CGI scripts () in conjunction with the Apache web server (), mod_perl () and Perl DBI () to serve HTML content, handle user input and query the MySQL database. Script.aculo.us version 1.61 (), an open source JavaScript library, was used to generate visual effects on the web pages that display search results.

Scoring models for Gene Ontology annotation

Using the latest release of the GO database, models were scored for a statistically significant number of genes in the model that were annotated with a particular GO term. We first identified the complete set of genes associated with each GO term. This set included the genes directly annotated with that term as well as those annotated with any of the term's descendents in the GO hierarchy. Next, we used the hypergeometric distribution (28,29) to test the genes in each model against the genes annotated with each of the GO terms. The resulting P-values were stored in the database.

Scoring similarity between publications

For each pair of publications we compared all models in one publication to all of the models in the other. To capture model similarity as sensitively as possible, we defined two models to be similar if they shared at least one interaction. The similarity score of a pair of publications was defined to be the number of distinct models that participated in any overlap divided by the total number of models in the pair. For example, consider publication A containing two models and publication B containing six models. If model 1 in A overlaps with models 1–5 in B, and model 2 in A only overlaps with model 1 in B, then the total number of distinct overlapping models is 7, and the similarity score between publications is 7/8.

RESULTS

A spectrum of network models

To date, interactions have been organized by searching for essentially three types of subnetworks: linear paths of interactions, interaction clusters or parallel clusters. Representative models of each type are shown in Figure 2. Linear (or branching) paths of interactions have been used to represent biological pathways such as metabolic processes or regulatory cascades (Figure 2a) (30–32). Clusters in an interaction network are regions of dense interconnections and are suggestive of functional protein complexes (Figure 2b) (33–37). Parallel clusters are two (or more) similar network clusters in which the proteins in one cluster are, in some way, associated with the proteins in the other cluster. Parallel clusters have been used to represent protein complexes conserved across species (Figure 2c) (38–40), in which pairs of proteins spanning the two clusters are orthologs associated by sequence-similarity relationships. They have also been used to identify the physical basis for genetic interactions (Figure 2d) (41), in which two protein interaction clusters are linked by many genetic interactions if the clusters perform redundant or synergistic cellular functions.

Figure 2

Representative network models stored in CellCircuits. (a) A collection of linear regulatory pathways downstream of mating-type locus in yeast (31) (b) An interaction cluster of co-expressed proteins suggestive of a functional complex (34) (c) Parallel clusters conserved between P.falciparum and yeast (40). (d) Parallel clusters that are highly connected by genetic interactions (41). Finally, integrating the interaction network with external data, such as gene expression profiles and other molecular states, has also been a key methodology used to identify significant subnetworks. For instance, these approaches have been used to find protein interaction clusters that exhibit coherent expression changes in response to panels of perturbations (33,35,36) or as a function of the cell cycle (34). Other works (42) have reported network ‘motifs’, defined as patterns of interactions that occur more often in the network than expected by chance. However, these approaches (by design) focus on general patterns rather than subnetworks of particular proteins. Therefore, they are not considered here.

Database coverage and assembly

This CellCircuits initial release (version 1.0) was designed as proof-of-principle of the value of a searchable database of network models. We focused on providing a clear database interface and representative, albeit incomplete, coverage of the types of network models possible. For version 1.0, the database includes models from 11 publications, spanning linear, clustered or parallel subnetworks, with priority given to publications with models available in both graphical representations and machine-readable formats (Table 1). Graphical representations of network models are a particularly valuable method of disseminating interactions and/or pathways, in much the same way that DNA sequence logos (43) are used to visualize position-specific score matrices of DNA-binding motifs. Conversely, machine-readable formats, such as SBML (44), BioPAX (45) or the Cytoscape SIF format (46), greatly facilitate database entry, model curation and subsequent computational analysis. Four publications provided models in both graphical and machine-readable formats (32,39–41). For the remaining seven, models were manually curated from published figures (30,31,33–36,38).

Table 1

Sources of data

aY = Yeast; W = Worm; F = Fly; H = Human; P = P.falciparum.

bCounts refer to total number of models across all organisms modeled.

cCounts refer to number of distinct genes in yeast only across all models.

dCounts refer to number of distinct interactions in yeast only across all models.

eFor gene expression, counts refer to number of profiles used.

Shading indicates which publications utilize particular types of Interaction data, State data, or Network patterns.

Sources of data aY = Yeast; W = Worm; F = Fly; H = Human; P = P.falciparum. bCounts refer to total number of models across all organisms modeled. cCounts refer to number of distinct genes in yeast only across all models. dCounts refer to number of distinct interactions in yeast only across all models. eFor gene expression, counts refer to number of profiles used. Shading indicates which publications utilize particular types of Interaction data, State data, or Network patterns. Manual curation involved downloading figures containing each network model, and then transcribing the genes and interactions in the models into a machine-readable format. For most publications, one figure, or each subpanel in a figure, contained a single network model. However, in three publications (31,34,38) the figures contained multiple, unconnected networks that were not divided by the authors into separate subpanels. In these cases, each unconnected component was entered as one model in CellCircuits, and in one case, networks were further subdivided into smaller models if they contained several sparsely connected, but functionally annotated, clusters of proteins (see Supplementary Table S1). These curation activities resulted in a total of 2019 protein network models in the database. Models in the database include protein interactions from five organisms: yeast (Saccharomyces cerevisiae; 91% of all models), fly (Drosophila melanogaster; 58%), nematode worm (Caenorhabditis elegans; 27%), a malarial parasite (Plasmodium falciparum; 2%) and human (2%; these percentages total >100% due to cross-species comparisons covering multiple species in a single model). The models include up to four types of interactions (protein–protein, protein–DNA, genetic and metabolic) as well as two types of external data (gene expression and gene deletion phenotypes).

Network model query

Models in the CellCircuits database are queried through a web-based interface. In the simplest use case, entering a standard gene name (e.g. RAD9) into the search field will return all models containing that gene. Wild-card searches are permitted (e.g. RAD* will search for models containing any gene with a name that begins with the letters RAD, see Figure 3). All gene queries are also checked against a list of gene name synonyms, which are drawn from the latest release of the GO database (47). In addition, searches can be limited to models from specific publications or to models containing genes from specific organisms.

Figure 3

Web interface (). Results using RAD* and ‘DNA binding’ as the search query (circle 1). A total of 274 subnetwork models are returned. The search output includes a graphical representation of the model (circle 7), the genes and GO terms from the model that match the query (circle 6), alternative gene names or synonyms matching the query (circle 9), the total number of matches (circle 8), enriched GO terms (circle 5 and 3), a link to view similar models (circle 4) and a link to example search queries (circle 2). Searches based on gene function are also supported. The CellCircuits database automatically scores all models for GO functional enrichment using the hypergeometric test (see Materials and Methods). Such tests had been originally applied in only 3 out of the 11 curated publications. The enrichment results are stored with each model in the database as meta-data, allowing users to search for models that are enriched for genes having a particular annotation. For example, some of the same models retrieved by searching for RAD9 can also be retrieved by searching for GO annotations associated with this gene. Queries may include exact GO ID numbers (e.g. GO:0006974) or partial or complete GO term names (e.g. ‘DNA damage’ or ‘integrity checkpoint’; these must be enclosed in double quotes). More than one gene, GO annotation or wild-card may be included in a query. If a model matches multiple search terms, it will be ranked higher in the results. All search results include graphical representations of the models, links to the original publication, the organism(s) modeled, the genes or GO annotations from the search query that were found in each model and the hypergeometric P-value of enrichment for any GO annotations (Figure 3).

Meta-analysis of models

Collecting published network models within a single database allowed us to survey the state of computational analysis of large interaction datasets. Scoring all models for GO functional enrichment (described in the previous section) is an example of such analyses. Another example, the observed sizes of models from all 11 publications, is shown in Figure 4a. On average, the 2019 models in the database contained ∼18 proteins and 36 interactions with 95% of models containing between 5 and 30 proteins. However, this distribution was heavily influenced by two publications (39,41) which together contributed over 90% of the models in the database.

Figure 4

Meta-analysis of models. (a) Histogram of the number of genes or proteins per model. (b) Histogram of the number of genes (y-axis) that are contained in a given number of models (x-axis). The inset is an expanded view of the genes that span over 50 models. (c) Overlap between network modeling publications. Thicker lines represent greater similarity between the sets of models published in two publications (see legend). Similarity is measured by the number of distinct models that share one or more interactions (yeast interactions only) divided by the total number of models in both publications. Interactions are shared between almost every pair of publications, but for clarity, similarity scores <0.05 are not shown. To assess the overlap between models, we examined the extent to which the same proteins appeared in multiple models (Figure 4b). Although a protein was shared by approximately nine models on average, the majority were found in only one or two models. Thirty-five proteins appeared in over 100 models (<5% of all models in the database). Interestingly, among these were all six of the yeast ATPases in the 26S proteasome (RPT1–6), components of the yeast and worm 20S proteasome, and several yeast, worm and fly protein kinases. The pervasiveness of these proteins in models may reflect their broad evolutionary conservation across species, a high degree of connectivity in the protein network, their popularity in the biological literature or their functional roles in many distinct biological processes (i.e. pleiotropy). The results of our model overlap analyses are accessible through the web interface. Each model is annotated in the CellCircuits database with a list of similar models, defined as those that contain at least three of the same genes. Clicking the ‘View similar models’ link in the search results will display these models (Figure 3, circle 4). Currently, only the number of shared genes is used to assess similarity between models. However, more complex measures could be envisioned, potentially making CellCircuits, itself, a resource for comparing several similar models (perhaps corresponding to the same biological process) and showing the differences between them. On a broader scale, we also assessed the extent to which publications covered overlapping regions of the protein interactome using a pairwise similarity score (see Materials and Methods). Results are shown in Figure 4c. Although our similarity score was permissive such that some overlap was expected between every pair of publications, only 5 out of the 55 possible pairs showed over 25% similarity. Thus, it appears that the different modeling publications are, to some degree, capturing different regions of the protein interaction network [excluding (39,41), see Figure 4c]. Furthermore, in the future, this kind of meta-analysis could be used to determine how the results from new publications differ from existing models.

DISCUSSION

In summary, CellCircuits version 1.0 provides a clearinghouse in which hypothetical pathway models derived from large-scale protein networks may be easily accessed, queried and exported for further study. The 11 publications included in this initial release were chosen to cover a broad range of network model types with a bias towards publications that provided models in both graphical and machine-readable format. Beyond this proof-of-principle, a significant question is whether, or to what extent, all past and future network models might be incorporated. On one hand, the field of network biology is still young such that the number of relevant previous publications is probably <50. On the other hand, the rapid adoption of systems and network approaches will make capturing information from all future works a daunting prospect if the models are not readily accessible. CellCircuits complements existing efforts that have begun to address this challenge, such as markup languages for describing models [BioPAX (45) and SBML (44)] and the BioModels Database of quantitative, kinetic models (27). Similar to biological sequence and microarray databases, we envision CellCircuits as a valuable resource for storing, accessing and updating network models across the wider biological research community.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

46 in total

Review 1. Global synthetic-lethality analysis and yeast functional profiling.

Authors: Siew Loon Ooi; Xuewen Pan; Brian D Peyser; Ping Ye; Pamela B Meluh; Daniel S Yuan; Rafael A Irizarry; Joel S Bader; Forrest A Spencer; Jef D Boeke
Journal: Trends Genet Date: 2005-11-23 Impact factor: 11.639

2. The Plasmodium protein network diverges from those of other eukaryotes.

Authors: Silpa Suthram; Taylor Sittler; Trey Ideker
Journal: Nature Date: 2005-11-03 Impact factor: 49.962

3. From genomics to chemical genomics: new developments in KEGG.

Authors: Minoru Kanehisa; Susumu Goto; Masahiro Hattori; Kiyoko F Aoki-Kinoshita; Masumi Itoh; Shuichi Kawashima; Toshiaki Katayama; Michihiro Araki; Mika Hirakawa
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

4. BioModels Database: a free, centralized database of curated, published, quantitative kinetic models of biochemical and cellular systems.

Authors: Nicolas Le Novère; Benjamin Bornstein; Alexander Broicher; Mélanie Courtot; Marco Donizelli; Harish Dharuri; Lu Li; Herbert Sauro; Maria Schilstra; Bruce Shapiro; Jacky L Snoep; Michael Hucka
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

5. TRANSPATH: an information resource for storing and visualizing signaling pathways and their pathological aberrations.

Authors: Mathias Krull; Susanne Pistor; Nico Voss; Alexander Kel; Ingmar Reuter; Deborah Kronenberg; Holger Michael; Knut Schwarzer; Anatolij Potapov; Claudia Choi; Olga Kel-Margoulis; Edgar Wingender
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

6. MetaCyc: a multiorganism database of metabolic pathways and enzymes.

Authors: Ron Caspi; Hartmut Foerster; Carol A Fulcher; Rebecca Hopkinson; John Ingraham; Pallavi Kaipa; Markus Krummenacker; Suzanne Paley; John Pick; Seung Y Rhee; Christophe Tissier; Peifen Zhang; Peter D Karp
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

7. BioGRID: a general repository for interaction datasets.

Authors: Chris Stark; Bobby-Joe Breitkreutz; Teresa Reguly; Lorrie Boucher; Ashton Breitkreutz; Mike Tyers
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

8. MIPS: analysis and annotation of proteins from whole genomes in 2005.

Authors: H W Mewes; D Frishman; K F X Mayer; M Münsterkötter; O Noubibou; P Pagel; T Rattei; M Oesterheld; A Ruepp; V Stümpflen
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

9. Text-mining and information-retrieval services for molecular biology.

Authors: Martin Krallinger; Alfonso Valencia
Journal: Genome Biol Date: 2005-06-28 Impact factor: 13.583

10. Validation and refinement of gene-regulatory pathways on a network of physical interactions.

Authors: Chen-Hsiang Yeang; H Craig Mak; Scott McCuine; Christopher Workman; Tommi Jaakkola; Trey Ideker
Journal: Genome Biol Date: 2005-07-01 Impact factor: 13.583

8 in total

Review 1. Selective Raf inhibition in cancer therapy.

Authors: Vladimir Khazak; Igor Astsaturov; Ilya G Serebriiskii; Erica A Golemis
Journal: Expert Opin Ther Targets Date: 2007-12 Impact factor: 6.902

Review 2. Mechanism-Centric Approaches for Biomarker Detection and Precision Therapeutics in Cancer.

Authors: Christina Y Yu; Antonina Mitrofanova
Journal: Front Genet Date: 2021-08-02 Impact factor: 4.772

3. An integrated systems analysis implicates EGR1 downregulation in simian immunodeficiency virus encephalitis-induced neural dysfunction.

Authors: Merril Gersten; Mehrdad Alirezaei; Maria Cecilia Garibaldi Marcondes; Claudia Flynn; Timothy Ravasi; Trey Ideker; Howard S Fox
Journal: J Neurosci Date: 2009-10-07 Impact factor: 6.167