Literature DB >> 18978014

Michigan molecular interactions r2: from interacting proteins to pathways.

V Glenn Tarcea¹, Terry Weymouth, Alex Ade, Aaron Bookvich, Jing Gao, Vasudeva Mahavisno, Zach Wright, Adriane Chapman, Magesh Jayapandian, Arzucan Ozgür, Yuanyuan Tian, Jim Cavalcoli, Barbara Mirel, Jignesh Patel, Dragomir Radev, Brian Athey, David States, H V Jagadish.

Abstract

Molecular interaction data exists in a number of repositories, each with its own data format, molecule identifier and information coverage. Michigan molecular interactions (MiMI) assists scientists searching through this profusion of molecular interaction data. The original release of MiMI gathered data from well-known protein interaction databases, and deep merged this information while keeping track of provenance. Based on the feedback received from users, MiMI has been completely redesigned. This article describes the resulting MiMI Release 2 (MiMIr2). New functionality includes extension from proteins to genes and to pathways; identification of highlighted sentences in source publications; seamless two-way linkage with Cytoscape; query facilities based on MeSH/GO terms and other concepts; approximate graph matching to find relevant pathways; support for querying in bulk; and a user focus-group driven interface design. MiMI is part of the NIH's; National Center for Integrative Biomedical Informatics (NCIBI) and is publicly available at: http://mimi.ncibi.org.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2008 PMID： 18978014 PMCID： PMC2686565 DOI： 10.1093/nar/gkn722

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Both the volume and number of data sources in molecular biology are increasing rapidly. Often multiple resources provide overlapping, partial and polymorphic views of the same data. Scientists often wish to piece together data from multiple source databases. Many web sites today recognize this need and provide one-stop access to multiple external databases. These include (1–4) to name a few. The IMEx consortium (http://imex.sf.net/) facilitates sharing of interaction information across multiple databases maintained by independent curators. However, all of these databases perform ‘shallow integration’: they make access convenient for the user by making all data available at one place, but make no attempt to pull that data into a cohesive whole. Source databases have overlapping coverage, resulting in multiple entries of the same information in the integrated result. For example, the same source publication may be cited in the interaction record in DIP, BIND, HPRD and BioGRID. Michigan molecular interactions (MiMI) helps scientists search through large quantities of information by integrating all information from participating data sources through the process of deep merging. As a result, redundant data are removed and related data are combined. Furthermore, in doing so, MiMI keeps track of the ‘provenance’ of each piece of information, or from where it was obtained. A website with this functionality was launched about 2 years ago, and described in (5). Since then, we have received a great deal of feedback, and have made further progress in integrating information. A completely redone MiMI Release 2 (MiMIr2) is now being released, and is the subject of this article. Noteworthy new features are mentioned below. MiMI is a component of the NIH's; National Center for Integrative Biomedical Informatics (http://www.ncibi.org), and is available at: http://mimi.ncibi.org free of charge, and with no registration required. A dump of MIMIr2 data is also available for free, but comes with some limitations on use and requires a license agreement.

BIOLOGICAL CONCEPTS

A central need for many scientists is to place a protein interaction in context—either in terms of genes that code for these proteins or in terms of pathways of which this interaction may be a part. MiMIr2 facilitates this by integrating relevant data from NCBI Entrez Gene on the one hand, and Reactome and KEGG on the other; and also by providing novel graph browsing and graph searching capabilities (described in the next section). MiMIr1 had information primarily regarding protein interactions, designed for scientists to query based on proteins of interest. We found, however, that most users had genes rather than proteins of interest. We also found many users cursory in mapping from genes of interest to proteins of interest—it is tempting to take the known gene name, assume that it codes for exactly one protein, and that the protein it codes for has the same name as the gene. We all know this is not always the case, but we do it nonetheless, getting results that are less than satisfactory. MiMIr2 remains an interaction database, but uses genes rather than proteins as the central identifying entity. Users ask about relationships between genes, and are informed about interacting products of these genes. One important purpose of looking at protein (or gene) interactions is to determine biological pathways of interest in which the subject genes plays a role. MiMIr2 has imported pathways from KEGG (6) and Reactome (7). For each gene and for each interaction, MiMIr2 can be used to find pathways in which it participates.

IDENTITY

The issue of identity is determining when two database entries refer to the same real world object. If two proteins have an almost identical sequence of amino acids and are expressed in the same organism, then they are likely to be the same. MiMIr1 included many rules, such as this, to figure out when two databases referred to the same protein. While this worked well for the most part, we found many protein fragments, and other variants, were recorded as separate molecules. When a scientist queried for a particular protein, these variants would not be returned, since they were considered distinct entities. Given the introduction of gene as the primary query entity in MiMIr2, we had a natural way to address this vexing identity question by mapping proteins to their coding genes. Now, two variants, even if treated as distinct proteins, can still be associated with the same gene. When a user queries for interactions associated with this gene, all proteins and fragments, and their interactions, are returned.

QUERY SPECIFICATION

MiMIr1 provided a wide variety of query interfaces, including a form-based interface and a visual query builder. However, we found that most users had a strong preference for an unfielded ‘Google-style’ search box into which they could enter query terms of their choice and then sift through returned results. (users also liked point and click query specifications—see next section on Cytoscape). As such, MiMIr2 has simplified its query interface to provide users with a single query box. The only field requiring explicit specification is the organism of interest, if there is one. Users can enter anything at all, including words used in a textual description field, including associations with a disease or biological process or molecular function. To support such queries better, we associate GO (8) labels with MiMIr2 entries. This resulted in a very useful query facility. However, there frequently are a number of genes associated with a biological term. Our system would order these based on information retrieval metrics, which may not match a scientist's; view of relative importance. To help address this, we developed a tool, Gene2Mesh, which could be used to go from a MeSH (9) heading to a ranked list of genes mentioned frequently in papers to which that MeSH heading was associated. It could also go from a Gene to a ranked list of MeSH headings associated with the gene. We integrated Gene2Mesh with MiMI so that a user can refine their topical query with MeSH terms if an initial attempt through an open text box yields unsatisfactory results. Gene2Mesh uses occurrence frequency to estimate importance of connection. However, not all occurrences are equally important. If we wish to find genes that are central to a particular disease or process, we should expect them to be centrally located in a graph of genes related to that disease or process. This notion of graph centrality is exploited to rank genes in a new tool, GIN (10,11), part of CLAIRLIB (12). MiMIr2 also integrates GIN, so that users can choose genes of importance related to a specific biological concept. Note that Gene2Mesh and GIN provide a means for users to learn about new genes related to a concept of interest. It is natural, when specifying queries, to consider one gene at a time. Clicking through from query results also works one click, to a specific gene, at a time. Yet, there is a class of scientists who are interested in sets of genes, for example, a set that is differentially over-expressed in some microarray experiment. In MiMIr2, we added a set-of-genes functionality to support such users. Users can either type in a list of gene symbols or gene IDs, or import a file containing these, and use that to query MiMIr2 through a special interface set up explicitly for this purpose. Once they are past this step, everything else works just as for single gene specifiers.

VIEWING INFORMATION

We observed scientists using MiMIr1, and also conducted focus groups, to identify result presentation features that users liked and disliked. Based on this feedback, we created a very sparse, but maximally informative, user interface, with every screen laid out carefully, and designed to work whether the number of returned results is 2 or 2000. By making the front page simple and minimalist in MiMIr2, we invite the user to type anything and get started with viewing useful results. Figure 1 shows the results of a query for the gene ABC1. The user is presented upfront with information, such as organism, aliases, description, GO cellular component, GO molecular function and GO biological process in order to facilitate identification and confirmation by the user. If the user determines this to be the gene of interest, quick links to further gene and protein information, interaction information, pathways and documentation exist for quick and easy navigation.

Figure 1.

The new results page in MiMI r2.

The new results page in MiMI r2. Sometimes a user may just wish to browse the database. Once a gene of interest is located, for example, a scientist may want to look around and see what else is there ‘in the neighborhood’. To support such use, we developed a browsing interface. The user can look through sets of genes that have the same values for one or more of several attributes. Additionally, many scientists find a visual interface useful to see a graph of interactions. Cytoscape (13) is a popular tool widely used for this purpose. MiMIr1 permitted users to export lists of interactions in SIF format, which could be read and viewed in a Cytoscape browser. However, there were several limitations: output storage, cytoscape initialization, format limitations, inability to query-on-the-fly with MiMI, etc. All of these issues have been addressed in MiMIr2 through the creation of a MiMI plugin for Cytoscape (14) and through the development of a single-click web start version of Cytoscape that can be invoked from the MiMI web site. Most pages in the MiMI web site now have a clickable link to Cytoscape that will lead the user to a Cytoscape browser session. Users starting from Cytoscape now have access to MiMI data, including all gene and interaction attributes, so that they can query based on these while remaining in the Cytoscape environment. The MiMI plugin for Cytoscape has already been downloaded 2318 times since its release earlier this year. Cytoscape is a large software package with many features. While we know many scientists who absolutely love it, we found others who have never used it and found the learning curve too steep to invest the effort required to become effective users. To address this section of our user base, we developed a stripped down ‘NetBrowser’ in Adobe Flash, with extremely limited functionality, but with a very simple interface that any one can use without training. From most MiMIr2 pages, the user can get to NetBrowser and see a set of interactions visually rather than in tabular form. Pathway databases have traditionally had their own visual representations of pathways, and we have preserved these as we integrated pathway data in MiMIr2. Moreover, we have provided a unique pathway search capability through SAGA (15,16). Once the user has identified a set of interacting genes of interest in NetBrowser, this interacting set can be used to query for pathways in which similar interaction patterns of these genes can be found (by clicking the button, ‘Export to SAGA’ from within NetBrowser).

DATA VERIFICATION

Scientists typically want to understand where data came from before they are willing to trust it. Being cognizant of this, in MiMIr1, we carefully preserved provenance information regarding the source of every piece of data, and we continue to do this in MiMIr2. Many interaction databases provide them with PubMed IDs of publications from which a fact was derived (and this information was preserved in MiMIr1). The user is then free to obtain the original publication and peruse it to determine the reliability of the reported interaction. MiMIr2 provides a unique capability, grounded in natural language processing, for the user to look up, while in MiMIr2 itself, the specific sentences in cited publications from which the interaction information was derived, with extracted terms of importance highlighted. (After search, click on the Gene name to see gene detail; scroll to Related Documents and click on View.) Figure 2 shows a sample of the available information. The user is thus saved the effort of locating and reading the entirety of the cited publication—rather their attention is immediately drawn to the few sentences that are likely to matter the most.

Figure 2.

Natural language processing identification of sentences within a document dealing with information of interest.

DATA SETS

MiMI currently has over 3.7 million interactions, along with information about approximately 3.5 million genes, 19.2 million molecules and 1288 pathways. Table 1 contains the list of sources in MiMI and their contributions. Additionally, supplementary protein information was integrated from: GO (8), InterPro (26), IPI (27), miBLAST (28), OrganelleDB (29), OrthoMCL (30) PFam (31) and ProtoNet (32).

Table 1.

Interaction and pathway datasets in MiMIr2

Source	Num. Mol.	Num. Int.
BIND (17)	111 112	233 201
Center for Cancer Systems Biology, Harvard (18)	3134	6683
BioGRID (19)	48 855	167 330
DIP (20)	76 771	49 677
HPRD (21)	73 209	116 333
IntAct (22)	197 141	77 780
Max Delbrück Center (23)	1909	3269
MINT (24)	195 119	125 672
Reactome (7)	48 808	2 375 780
WSU Campylobacter Jejuni Interactome (25)	1332	12 012

Interaction and pathway datasets in MiMIr2

FUNDING

National Institutes of Health (R01 LM008106 and U54 DA021519, partial). Funding for open access charge: NIH grant U54 DA021519. Conflict of interest statement. None declared.

29 in total

1. Creating the gene ontology resource: design and implementation.

Authors:
Journal: Genome Res Date: 2001-08 Impact factor: 9.043

2. IntAct: an open source molecular interaction database.

Authors: Henning Hermjakob; Luisa Montecchi-Palazzi; Chris Lewington; Sugath Mudali; Samuel Kerrien; Sandra Orchard; Martin Vingron; Bernd Roechert; Peter Roepstorff; Alfonso Valencia; Hanah Margalit; John Armstrong; Amos Bairoch; Gianni Cesareni; David Sherman; Rolf Apweiler
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

3. ProtoNet: hierarchical classification of the protein space.

Authors: Ori Sasson; Avishay Vaaknin; Hillel Fleischer; Elon Portugaly; Yonatan Bilu; Nathan Linial; Michal Linial
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

4. BIND: the Biomolecular Interaction Network Database.

Authors: Gary D Bader; Doron Betel; Christopher W V Hogue
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

5. Cytoscape: a software environment for integrated models of biomolecular interaction networks.

Authors: Paul Shannon; Andrew Markiel; Owen Ozier; Nitin S Baliga; Jonathan T Wang; Daniel Ramage; Nada Amin; Benno Schwikowski; Trey Ideker
Journal: Genome Res Date: 2003-11 Impact factor: 9.043

6. MEDICAL SUBJECT HEADINGS IN MEDLARS.

Authors: W SEWELL
Journal: Bull Med Libr Assoc Date: 1964-01

7. DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions.

Authors: Ioannis Xenarios; Lukasz Salwínski; Xiaoqun Joyce Duan; Patrick Higney; Sul-Min Kim; David Eisenberg
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

8. Identifying gene-disease associations using centrality on a literature mined gene-interaction network.

Authors: Arzucan Ozgür; Thuy Vu; Günes Erkan; Dragomir R Radev
Journal: Bioinformatics Date: 2008-07-01 Impact factor: 6.937

9. Development of human protein reference database as an initial platform for approaching systems biology in humans.

Authors: Suraj Peri; J Daniel Navarro; Ramars Amanchy; Troels Z Kristiansen; Chandra Kiran Jonnalagadda; Vineeth Surendranath; Vidya Niranjan; Babylakshmi Muthusamy; T K B Gandhi; Mads Gronborg; Nieves Ibarrola; Nandan Deshpande; K Shanker; H N Shivashankar; B P Rashmi; M A Ramya; Zhixing Zhao; K N Chandrika; N Padma; H C Harsha; A J Yatish; M P Kavitha; Minal Menezes; Dipanwita Roy Choudhury; Shubha Suresh; Neelanjana Ghosh; R Saravana; Sreenath Chandran; Subhalakshmi Krishna; Mary Joy; Sanjeev K Anand; V Madavan; Ansamma Joseph; Guang W Wong; William P Schiemann; Stefan N Constantinescu; Lily Huang; Roya Khosravi-Far; Hanno Steen; Muneesh Tewari; Saghi Ghaffari; Gerard C Blobe; Chi V Dang; Joe G N Garcia; Jonathan Pevsner; Ole N Jensen; Peter Roepstorff; Krishna S Deshpande; Arul M Chinnaiyan; Ada Hamosh; Aravinda Chakravarti; Akhilesh Pandey
Journal: Genome Res Date: 2003-10 Impact factor: 9.043

10. Introducing meta-services for biomedical information extraction.

Authors: Florian Leitner; Martin Krallinger; Carlos Rodriguez-Penagos; Jörg Hakenberg; Conrad Plake; Cheng-Ju Kuo; Chun-Nan Hsu; Richard Tzong-Han Tsai; Hsi-Chuan Hung; William W Lau; Calvin A Johnson; Rune Saetre; Kazuhiro Yoshida; Yan Hua Chen; Sun Kim; Soo-Yong Shin; Byoung-Tak Zhang; William A Baumgartner; Lawrence Hunter; Barry Haddow; Michael Matthews; Xinglong Wang; Patrick Ruch; Frédéric Ehrler; Arzucan Ozgür; Güneş Erkan; Dragomir R Radev; Michael Krauthammer; ThaiBinh Luong; Robert Hoffmann; Chris Sander; Alfonso Valencia
Journal: Genome Biol Date: 2008-09-01 Impact factor: 13.583

47 in total

1. The NIH National Center for Integrative Biomedical Informatics (NCIBI).

Authors: Brian D Athey; James D Cavalcoli; H V Jagadish; Gilbert S Omenn; Barbara Mirel; Matthias Kretzler; Charles Burant; Raphael D Isokpehi; Charles DeLisi
Journal: J Am Med Inform Assoc Date: 2011-11-19 Impact factor: 4.497

2. GSearcher: agile attribute querying for biological networks.

Authors: Gang Su; Brian D Athey; Fan Meng
Journal: Bioinformatics Date: 2010-11-18 Impact factor: 6.937

3. Interaction databases on the same page.

Authors: Andrei L Turinsky; Sabry Razick; Brian Turner; Ian M Donaldson; Shoshana J Wodak
Journal: Nat Biotechnol Date: 2011-05 Impact factor: 54.908

4. Metscape: a Cytoscape plug-in for visualizing and interpreting metabolomic data in the context of human metabolic networks.

Authors: Jing Gao; V Glenn Tarcea; Alla Karnovsky; Barbara R Mirel; Terry E Weymouth; Christopher W Beecher; James D Cavalcoli; Brian D Athey; Gilbert S Omenn; Charles F Burant; H V Jagadish
Journal: Bioinformatics Date: 2010-02-07 Impact factor: 6.937