Literature DB >> 19502495

DASMIweb: online integration, analysis and assessment of distributed protein interaction data.

Hagen Blankenburg¹, Fidel Ramírez, Joachim Büch, Mario Albrecht.

Abstract

In recent years, we have witnessed a substantial increase of the amount of available protein interaction data. However, most data are currently not readily accessible to the biologist at a single site, but scattered over multiple online repositories. Therefore, we have developed the DASMIweb server that affords the integration, analysis and qualitative assessment of distributed sources of interaction data in a dynamic fashion. Since DASMIweb allows for querying many different resources of protein and domain interactions simultaneously, it serves as an important starting point for interactome studies and assists the user in finding publicly accessible interaction data with minimal effort. The pool of queried resources is fully configurable and supports the inclusion of own interaction data or confidence scores. In particular, DASMIweb integrates confidence measures like functional similarity scores to assess individual interactions. The retrieved results can be exported in different file formats like MITAB or SIF. DASMIweb is freely available at http://www.dasmiweb.de.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2009 PMID： 19502495 PMCID： PMC2703953 DOI： 10.1093/nar/gkp438

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Protein interactions play an important role in many cellular processes (1). Different small- and large-scale experimental techniques together with the manual curation of the scientific literature as well as numerous computational prediction methods generate ever increasing amounts of publicly accessible protein interaction data (2). However, this rapid accumulation of data renders it difficult for researchers to keep track of all available information because they are scattered over multiple online repositories. As of April 2009, the pathway resource list Pathguide (3) gives the impressive number of 118 databases providing protein interaction data. Some of these projects are highly specialized and focus, for example, on interactions of molecular subcomponents or specific classes of proteins, on specific diseases or organisms, or on experimentally observed or computationally predicted interactions. Moreover, doubts have been raised about the quality and reliability of protein interaction data and particular detection methods (2,4,5). Databases that collect and curate experimentally observed protein–protein interactions reported in the literature (6–13) are essential pillars of interactomics, but they cover only a small fraction of the complete set of interactions, and thus proteome-wide predictions are also required (2,4). All these efforts have resulted in a multitude of resources that the user has to query individually. Initiatives like IMEx (14) that promote data exchange between some of the databases are very important, but are still in an early implementation phase. One of the possible solutions to integrate protein interaction data is the creation of data warehouses as composite databases that centrally store and merge the available data from multiple sources (10,11,15–23). However, the static data unification procedure underlying data warehouses has the considerable drawback of providing only a snapshot of a fixed number of data sources at a certain point of time. Once the data have been included into the central repository, curation efforts are required to keep it up to date and in sync with the original data sources. Furthermore, data warehouses are rather inflexible as the inclusion of additional datasets, for example, new experimental or predicted data or improved confidence scores, can normally be accomplished solely by the central authority and not by the user. In the context of the European BioSapiens network (24), we have developed DASMIweb as a gateway to interactome data from multiple resources. In contrast to composite databases, data are not stored in a local repository, but queries are distributed to the original data sources and the unified results are displayed (25). Due to this novel realization as a distributed and dynamic system, DASMIweb bypasses the inherent rigidity of static databases and addresses their problem of data update cycles. In addition, DASMIweb allows access to distributed servers with confidence scores, which can be used to evaluate the quality of individual interactions with different scoring methods.

MATERIALS AND METHODS

Distributed architecture

The fundamental concept of DASMIweb is decentralization (Figure 1). Here, the interaction data remain distributed with their original providers instead of being periodically aggregated into central data repositories (10,11,15–23). Subsequent to a user request, DASMIweb independently queries each original data provider for interactions, additional annotations, and interaction confidence scores. Then it unifies the retrieved results and presents them to the user.

Figure 1.

Decentralized architecture of DASMIweb. Data sources for protein and domain interactions as well as for interaction confidence scores are distributed over the Internet and are contacted by DASMIweb upon user request. The technical architecture of DASMIweb, based on an extension of the Distributed Annotation System (DAS) (25,26) and different types of web services (27,28), has the great advantage of being easily extendable with new data sources. In addition, data update cycles every few weeks or months are not necessary because all data is left in its source database and is only retrieved on request. This use of a distributed architecture greatly empowers the end-user who can instantly add data sources, for instance, own private interactions or the results of an improved confidence scoring method. DASMIweb also gives data providers the possibility to easily share their results without the time-consuming development of own web interfaces. Since the distributed architecture supported by DASMIweb is driven by the community that provides the contents, there is no need for central authorities to decide on the available resources. This is exemplarily evident in the case of confidence scoring methods, which can be based on various criteria, for instance, co-expression, co-localization, functional co-annotation, network topology and evolutionary conservation. Apparently, it would be impractical to implement and maintain all these different methods at a single site. Instead, each method developed by some independent institution can be queried through DASMIweb. It is noteworthy that our decentralization approach has also found the interest of the Proteomics Standards Initiative (PSI) of the Human Proteome Organization (HUPO), which is currently defining standards for distributed interaction data retrieval and interaction confidence scoring (28). We are actively contributing to these projects and all servers developed in this context are accessible via DASMIweb (Table 1).

Table 1.

Interaction datasets currently available as data sources through DASMIweb and corresponding references

Dataset name	Interaction type	Determination method	Server type	Reference
CCSB-HI1	PPI	Large-scale experiment	DASMI	(37)
MDC	PPI	Large-scale experiment	DASMI	(38)
BioGRID	PPI	Literature curation	PSI WS	(10)
DIP	PPI	Literature curation	DASMI	(8)
HPRD	PPI	Literature curation	DASMI	(13)
IntAct	PPI	Literature curation	PSI WS	(12)
MINT	PPI	Literature curation	PSI WS	(6)
MPIDB	PPI	Literature curation	PSI WS	(11)
Bioverse	PPI	Prediction	DASMI	(39)
HiMAP	PPI	Prediction	DASMI	(40)
HiMAP core	PPI	Prediction	DASMI	(40)
HomoMINT	PPI	Prediction	DASMI	(41)
OPHID	PPI	Prediction	DASMI	(42)
POINT	PPI	Prediction	DASMI	(43)
Sanger	PPI	Prediction	DASMI	(44)
Sanger core	PPI	Prediction	DASMI	(44)
3did	DDI	3D structure analysis	DASMI	(45)
iPfam	DDI	3D structure analysis	DASMI	(46)
PiNS	DDI	3D structure analysis	DASMI	(47)
APMM1	DDI	Prediction	DASMI	(48)
APMM2	DDI	Prediction	DASMI	(48)
DIMA 2.0 dprof	DDI	Prediction	DASMI	(18)
DIMA 2.0 dpea	DDI	Prediction	DASMI	(18)
DIMA 2.0 string	DDI	Prediction	DASMI	(18)
DPEA	DDI	Prediction	DASMI	(49)
InterDom	DDI	Prediction	DASMI	(50)
IPPRI	DDI	Prediction	DASMI	(51)
IPPRI core	DDI	Prediction	DASMI	(51)
LDSC	DDI	Prediction	DASMI	(52)
LDSC core	DDI	Prediction	DASMI	(52)
LLZ	DDI	Prediction	DASMI	(53)
LP	DDI	Prediction	DASMI	(54)
RCDP50	DDI	Prediction	DASMI	(55)
RDFF	DDI	Prediction	DASMI	(56)
TW	DDI	Prediction	DASMI	(57)
FunSimMat	PPI-CM	Gene Ontology-based	DASMI	(58,59)
Domain support	PPI-CM	DDI-based	DASMI	(2)

The interaction type PPI indicates protein–protein interactions, DDI domain–domain interactions, and PPI-CM confidence measures that can be used to asses the quality of protein–protein interactions. The server type DASMI indicates data sources that are available by an extension of the DAS protocol, and PSI WS denotes web services following the standard currently being developed by HUPO-PSI.

Interaction datasets currently available as data sources through DASMIweb and corresponding references The interaction type PPI indicates protein–protein interactions, DDI domain–domain interactions, and PPI-CM confidence measures that can be used to asses the quality of protein–protein interactions. The server type DASMI indicates data sources that are available by an extension of the DAS protocol, and PSI WS denotes web services following the standard currently being developed by HUPO-PSI.

Data sources

DASMIweb has been developed to support different levels of molecular interactions, for example, interactions of proteins as well as of protein domains. In the following, we will refer to the distributed data servers that provide interactions, additional annotations or interaction confidence scores, as data sources (in contrast to our server DASMIweb). As of April 2009, DASMIweb provides access to 35 data sources containing experimentally determined and computationally derived protein and domain interaction datasets (Table 1). In addition, there are two data sources for scoring the confidence of protein–protein interactions. In our current setup, the majority of data sources are temporarily cached and maintained at our institute, even if this appears to be a contradiction to the actual DASMIweb aim of leaving the interaction data with their original providers. At present, this setup is unavoidable for demonstrating the capabilities of our system; otherwise, many more external data source providers would already be needed from the beginning. Nevertheless, several major protein interaction databases like BioGrid (10), IntAct (12) and MINT (6) are not cached as they already support external web service access to their data. Moreover, since most of the cached datasets are the results of single studies, they are not updated by their authors and do not require any maintenance efforts. The other data sources are updated by us on a monthly basis. Of course, we will replace a temporary cached source as soon as the respective original provider supports web service access to its data. The current selection of data sources listed in Table 1 is only a snapshot, additional resources for interactions and confidence scores are currently being prepared at other institutions for public access in the near future. At the moment, there are two alternative ways for providing new data sources. The first option is the download of a server library from our website http://www.dasmi.de. This software, available as Java and Perl implementations, parses interaction data from several standard file formats and serves them in a format supported by DASMIweb (25). An online tutorial on setting up own data sources is available on our website. Data sources that provide confidence scores are handled like sources that contain interaction data and can be set up using the same software library. The second option is the implementation of a web service that follows the standard currently defined by HUPO-PSI (28). However, as this standard is not yet published, it might still evolve.

Identifier mapping

Proteomics research uses a substantial diversity of object identifiers for describing genes, proteins or protein domains. Accordingly, interaction datasets use a variety of identifier systems for their data (2). In order to unify them, DASMIweb maintains internal mapping tables derived from iProClass (29) and Pfam (30) to convert identifiers between the different systems. For protein interactions, DASMIweb currently supports the identifier systems Ensembl (31), Entrez Gene (32), Entrez Geneinfo (32), RefSeq (32) and UniProtKB (33). In the following, we will refer to these identifier systems as compatible systems because mappings exist between them. The mappings enable DASMIweb to merge results from data sources that employ different but compatible identifier system. For example, if the user requests all huntingtin interactions known for the UniProtKB protein ‘HD_HUMAN’, DASMIweb will automatically convert this identifier to the Entrez Gene identifier ‘3064’, the RefSeq identifier ‘NP_002102.4’, and all other compatible identifier systems described above. Subsequently, data sources providing interactions in the Entrez Gene or RefSeq identifier system will be queried in addition to data sources that use the requested UniProtKB identifier system. The final unification of interactions from different data sources is performed by converting all identifiers to Entrez Gene identifiers. Therefore, the DASMIweb results are independent of the particular identifier system used for querying; in the prior example, a query for ‘HD_HUMAN’ will return the same results as the one for ‘NP_002102.4’ or ‘3064’. It should be noted that the identifier mapping procedure can result in considerable, but unavoidable, computational overhead. While there is usually a one-to-one mapping from UniProtKB to Entrez Gene identifiers, mapping in the opposite direction may produce multiple results as one gene can be responsible for several protein variants or fragments. Therefore, in the exemplary case when the user queries DASMIweb with the Entrez Gene identifier ‘3064’, it will be converted to two UniProtKB entries ‘Q59FF4’ and ‘P42858’. Consequently, all interactions reported for the two protein variants or fragments will be included. Fortunately, the identifier diversity for domain interaction datasets is less problematic as stable Pfam identifiers (30) are predominantly used.

USER INTERFACE

Our primary goal while designing DASMIweb was user-friendliness, which we tried to achieve by an intuitive user interface and a clear representation of the results. In addition, several tutorials on our website guide the user through potential query and analysis tasks. The most important parts of the user interface are the Query, Information, and Interaction Panels (Figure 2). DASMIweb requires a JavaScript-enabled browser for a technology known as Asynchronous JavaScript and XML (AJAX). The AJAX functionality, provided by the Direct Web Remoting framework (http://getahead.org/dwr/), compensates for different data source response times and allows for presenting interaction results to the user as soon as DASMIweb receives them. DASMIweb stores all data associated with a user request in sessions. This means that all interactions, interaction details, confidence scores and all DASMIweb configurations are maintained for half an hour even if the user temporarily leaves our website.

Figure 2.

DASMIweb user interface. The screen is separated into the top left Query Panel, the top right Information Panel and the central Interaction Panel. Interactions are presented in tabular form: each column represents a data source, each row contains interaction partner(s), and each square at the intersection of a row and a column indicates a particular interaction. The Gene Ontology-based confidence measure FunSimMat-BPscore is selected, and the interaction squares are colored with a white-to-blue gradient: white for no functional similarity and dark blue for complete similarity.

Querying

The Query Panel in the top left corner of the screen only contains a single search field to allow the user straightforward querying. The user does not need to specify the input type; DASMIweb tries to determine it automatically. If the identifier type cannot be resolved unambiguously, the user is asked to refine the query. As detailed above, DASMIweb will not only include all data sources with the same identifier system of the query, but also attempt mapping the query identifier to all compatible identifier systems to include additional sources. DASMIweb converts only between protein (e.g. UniProtKB or RefSeq) and gene (e.g. Entrez Gene) identifier systems, which does not include domain identifier systems like Pfam. All data sources for which a suitable identifier is found will subsequently be queried for interactions.

Result presentation

Information on the query interactor, such as names, synonyms, or external database references, is provided in the Information Panel located in the top right corner of the screen. Interaction results are presented to the user in a table within the central Interaction Panel: table columns represent data sources that have been queried for interactions, rows contain interaction partners (single partners for binary interactions and all partners for protein complexes), and squares in the intersections of rows and columns indicate particular interactions (Figure 2). Different background colors for each data source in the table header highlight the corresponding interaction determination methods (Table 1): green represents sources with data derived from experimental studies or curation of the scientific literature, yellow represents computational predictions. The interaction table is built gradually, and new rows and interaction squares are inserted as soon as results have been retrieved from the data sources. In addition, the interactions can be sorted according to individual table columns or by their frequency of occurrence in all data sources. For the sake of clarity, a tabbed display only shows a user-definable number of interactions per page (initially set to 50); arrows allow for browsing through additional results. Display options like sorting and tabbed browsing can be configured in the myDASMI Panel, which can be opened by clicking on the correspondent box in the middle of the right screen border. Our tabular representation supports a quick visual assessment of the results, based on the assumption that interactions that are reported in several datasets are more likely to be accurate. To further investigate particular interactions, the user can click on an interaction square and request the display of a new table row with additional information on an interaction and the interaction partner(s). For example, the additional information may include links to the original publication that reported the interaction, information on the experimental settings or conditions, a web link to the full entry in the source database or external database identifiers for the interaction partner(s). The amount of details given in the additional information is primarily defined by the data source providers that reports the respective interaction and not by DASMIweb.

Data source configuration

DASMIweb maintains a list of all publicly available interaction sources (Table 1) in the Source Configuration Panel, which can be opened by pressing the corresponding button in the Query Panel. The sources are grouped according to their identifier system, and basic information like the name, source type, and a description are provided for each entry. As described above, green and yellow background colors indicate different interaction determination methods. A blue background represents data sources useful for interaction confidence scoring. Initially, all data sources are active and will be used for answering queries. The user can deactivate data sources by removing the leading checkmark of the respective entry. We currently support three approaches for including new data sources in DASMIweb. First, all data sources registered at the central DAS registry (http://www.dasregistry.org) will be available automatically to the DASMIweb users. The second option is the local registration of a data source by providing information like its name, URL, and identifier system. The third option is uploading a PSI-MI XML2.5 file (34) to DASMIweb and temporarily creating a data source from its content. The first two options require that the data source to be added is already set up and accessible over the Internet. In contrast, the third option allows for comparing own interactions with existing datasets or for assessing them by different confidence scoring servers. Another distinction can be made with respect to data privacy: data sources added with the first approach are accessible to all DASMIweb users, while the second and third approaches affect only the respective user session.

Interaction confidence scoring

Despite improved methods for generating protein interaction data (35), current interactomes are still incomplete to a large extent and doubts about the reliability of detection methods remain (2,4,5). Quality assessment is crucial not only for interactions determined by large-scale experimental assays, but also for those curated from scientific literature or obtained by computational prediction methods. Therefore, DASMIweb provides access to specialized data sources (Table 1) and, at the same time, supports the convenient evaluation of the quality of individual protein interactions. As the distributed retrieval of confidence scores for a large number of interactions can be computationally demanding, it has to be explicitly requested by pressing a button in the header of the interaction table. After retrieval, the different scoring methods can be selected in a drop-down menu next to the same button. This menu also lists all original confidence scores provided by the authors of a source dataset. A brief description of the selected scoring method can be found in the bottom right corner of the screen. Confidence scores are printed atop the interaction squares and are also available as new interaction details. If the scores of a method can be normalized to a range between zero and one, they will additionally be color-coded by a white-to-blue gradient into the respective interaction squares, white for the value zero, dark blue for the value one (Figure 2).

Exporting results

Interaction results can be exported in different file formats, enabling the user to analyze the retrieved data further in other applications. Currently, we support the Simple Interaction Format (SIF), defined by the network analysis and visualization program Cytoscape (36), and the tabular MITAB2.5 format as specified by HUPO-PSI (34).

CONCLUSIONS

We presented our new web server DASMIweb that supports the online integration, analysis and assessment of distributed sets of molecular interaction data in a dynamic and user-configurable fashion. DASMIweb provides access to over thirty different interaction and confidence scoring resources, which constitutes one of the largest amounts of protein and domain interaction data available through one web interface. In particular, DASMIweb can be used to assess the quality of arbitrary user-defined sets of protein interactions with different confidence scoring methods. Due to the decentralized architecture, users can easily extend DASMIweb by adding further data sources. Additional data sources for providing protein interactions and confidence scoring methods are already expected to be made available within DASMIweb by different external sites in the near future. Furthermore, additional DASMIweb features are currently under development, ranging from support for full-text searches and batch queries for multiple interactors to additional data import and export formats.

57 in total

1. Integration of biological networks and gene expression data using Cytoscape.

Authors: Melissa S Cline; Michael Smoot; Ethan Cerami; Allan Kuchinsky; Nerius Landys; Chris Workman; Rowan Christmas; Iliana Avila-Campilo; Michael Creech; Benjamin Gross; Kristina Hanspers; Ruth Isserlin; Ryan Kelley; Sarah Killcoyne; Samad Lotia; Steven Maere; John Morris; Keiichiro Ono; Vuk Pavlovic; Alexander R Pico; Aditya Vailaya; Peng-Liang Wang; Annette Adler; Bruce R Conklin; Leroy Hood; Martin Kuiper; Chris Sander; Ilya Schmulevich; Benno Schwikowski; Guy J Warner; Trey Ideker; Gary D Bader
Journal: Nat Protoc Date: 2007 Impact factor: 13.491

2. Pathguide: a pathway resource list.

Authors: Gary D Bader; Michael P Cary; Chris Sander
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

3. MPact: the MIPS protein interaction resource on yeast.

Authors: Ulrich Güldener; Martin Münsterkötter; Matthias Oesterheld; Philipp Pagel; Andreas Ruepp; Hans-Werner Mewes; Volker Stümpflen
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

4. APID: Agile Protein Interaction DataAnalyzer.

Authors: Carlos Prieto; Javier De Las Rivas
Journal: Nucleic Acids Res Date: 2006-07-01 Impact factor: 16.971

5. Topology and weights in a protein domain interaction network--a novel way to predict protein interactions.

Authors: Stefan Wuchty
Journal: BMC Genomics Date: 2006-05-23 Impact factor: 3.969

6. A new measure for functional similarity of gene products based on Gene Ontology.

Authors: Andreas Schlicker; Francisco S Domingues; Jörg Rahnenführer; Thomas Lengauer
Journal: BMC Bioinformatics Date: 2006-06-15 Impact factor: 3.169

7. DOMINE: a database of protein domain interactions.

Authors: Balaji Raghavachari; Asba Tasneem; Teresa M Przytycka; Raja Jothi
Journal: Nucleic Acids Res Date: 2007-10-02 Impact factor: 16.971

8. DIMA 2.0--predicted and known domain interactions.

Authors: Philipp Pagel; Matthias Oesterheld; Oksana Tovstukhina; Norman Strack; Volker Stümpflen; Dmitrij Frishman
Journal: Nucleic Acids Res Date: 2007-11-13 Impact factor: 16.971

9. Broadening the horizon--level 2.5 of the HUPO-PSI format for molecular interactions.

Authors: Samuel Kerrien; Sandra Orchard; Luisa Montecchi-Palazzi; Bruno Aranda; Antony F Quinn; Nisha Vinod; Gary D Bader; Ioannis Xenarios; Jérôme Wojcik; David Sherman; Mike Tyers; John J Salama; Susan Moore; Arnaud Ceol; Andrew Chatr-Aryamontri; Matthias Oesterheld; Volker Stümpflen; Lukasz Salwinski; Jason Nerothin; Ethan Cerami; Michael E Cusick; Marc Vidal; Michael Gilson; John Armstrong; Peter Woollard; Christopher Hogue; David Eisenberg; Gianni Cesareni; Rolf Apweiler; Henning Hermjakob
Journal: BMC Biol Date: 2007-10-09 Impact factor: 7.431