Literature DB >> 15608232

STRING: known and predicted protein-protein associations, integrated and transferred across organisms.

Christian von Mering¹, Lars J Jensen, Berend Snel, Sean D Hooper, Markus Krupp, Mathilde Foglierini, Nelly Jouffre, Martijn A Huynen, Peer Bork.

Abstract

A full description of a protein's function requires knowledge of all partner proteins with which it specifically associates. From a functional perspective, 'association' can mean direct physical binding, but can also mean indirect interaction such as participation in the same metabolic pathway or cellular process. Currently, information about protein association is scattered over a wide variety of resources and model organisms. STRING aims to simplify access to this information by providing a comprehensive, yet quality-controlled collection of protein-protein associations for a large number of organisms. The associations are derived from high-throughput experimental data, from the mining of databases and literature, and from predictions based on genomic context analysis. STRING integrates and ranks these associations by benchmarking them against a common reference set, and presents evidence in a consistent and intuitive web interface. Importantly, the associations are extended beyond the organism in which they were originally described, by automatic transfer to orthologous protein pairs in other organisms, where applicable. STRING currently holds 730,000 proteins in 180 fully sequenced organisms, and is available at http://string.embl.de/.

Entities: Disease Species

Mesh：

Substances：

Year: 2005 PMID： 15608232 PMCID： PMC539959 DOI： 10.1093/nar/gki005

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Several databases exist, whose main purpose is to collect and curate direct experimental evidence about protein–protein interactions (1–4). Other databases take a more generalized perspective on proteins and their associations, by functionally grouping proteins into metabolic, signaling or transcriptional pathways (5–8). Finally, a third class of resources attempts to fill gaps in both datasets, by predicting protein–protein associations de novo, using a variety of computational techniques (9–13). The database STRING (‘Search Tool for the Retrieval of Interacting Genes/Proteins’) represents an ongoing effort to provide these three types of protein–protein association evidence under one common framework. Such an integrated approach offers several unique advantages: (i) various types of evidence are mapped onto a single, stable set of proteins, thereby facilitating comparative analysis; (ii) known and predicted interactions often partially complement each other, leading to increased coverage; (iii) an integrated scoring scheme can provide higher confidence when independent evidence types agree; and (iv) mapping and transferring interactions onto a large number of organisms facilitates evolutionary studies. Because STRING is fully pre-computed, all information can be quickly accessed—both at the high-level network view and at the level of the individual interaction record. The various evidence types can be enabled or disabled separately, which allows the searches to be customized at run-time, and dedicated viewers allow the inspection of all the evidence underlying an association (Figure 1). The database is an exploratory resource: it contains a much larger number of associations than primary interaction databases—albeit with varying confidence scores. It is thus best used for getting a quick initial overview of the functional partners of a query protein, especially for proteins that are still poorly characterized.

Figure 1

Results from a STRING search. Inserts show partial screen shots from evidence pages, which are accessible from the main result page. Two proteins were used as inputs to the query—one is a subunit from the yeast ATP synthase complex, the other a subunit from the ubiquinol–cytochrome C reductase complex. The number of requested partners was limited to 10 (default settings). STRING reports both proteins to be members of functional modules, which are in turn connected as part of a larger unit. The diversity of evidence types supporting the modules is noted.

DATA SOURCES AND SCORING

Many of the protein–protein associations in STRING are imported from other databases (see below), but STRING also contains a large body of predicted associations that are produced de novo. These predictions are based on systematic genome comparisons [‘genomic context’, (14,15)]. We periodically import completely sequenced genomes [metazoan genomes from Ensembl, all others from SwissProt, (16)], and search them for three types of genomic context associations: conserved genomic neighborhood, gene fusion events, and co-occurrence of genes across genomes. All three searches aim to identify pairs of genes which appear to be under common selective pressures during evolution (more so than expected by chance), and which are therefore thought to be functionally associated. As for all other types of associations in STRING, we assign a confidence score to each predicted association. The scores are derived by benchmarking the performance of the predictions against a common reference set of trusted, true associations. We chose the functional grouping of proteins maintained at KEGG [Kyoto Encyclopedia of Genes and Genomes, (5)] as the reference. Any predicted association for which both proteins are assigned to the same ‘KEGG pathway’ is counted as a true positive. KEGG pathways are particularly suitable as a reference because they are based on manual curation, are available for a number of organisms, and cover several functional areas. The benchmarked confidence scores in STRING generally correspond to the probability of finding the linked proteins within the same KEGG pathway. STRING performs a similar benchmark for high-throughput experimental interaction data, separately for each dataset. Scores vary within one dataset because they include additional, intrinsic information from the data itself, such as the frequency or reciprocality of the detection (see Figure 2 for a typical benchmark). In contrast to high-throughput data, validated small-scale interactions, protein complexes, and annotated pathways are directly imported from databases (2,5,17), and given a uniform confidence score per dataset.

Figure 2

Deriving confidence scores for high-throughput interaction data [exemplified here for a dataset of protein complex purifications (22)]. In this case, the relative confidence depends on how often two proteins are pulled down together (a and b), versus how often they are pulled down alone (c and d). A purification is counted twice when one of the partners is the bait (a and d). Raw quality is: Q = log{(Ntogether · Ntotal)/[(Nalone1 + 1) · (Nalone2 + 1)]}.

Another important source of protein association information is the published literature (18,19). We systematically extract associations from PubMed, by searching for recurrent co-mentioning of gene names in abstracts. This search relies on gene names and synonyms parsed from SwissProt as well as from organism-specific databases, and we utilize a benchmarked scoring system based on the frequencies and distributions of gene names in abstracts (not shown). Finally, we also derive protein–protein associations from functional genomics data: co-regulation of genes across diverse experimental conditions, as measured by using microarray analysis, can be a predictor of functional associations (20). We import these associations from the ArrayProspector server (12), which is based on the same benchmarks and genomes as STRING itself.

TRANSFER OF ASSOCIATIONS ACROSS ORGANISMS

STRING employs two different strategies for transferring known and predicted associations between organisms (Figure 3): the first (‘COG-mode’) relies on externally provided orthology assignments and transfers interactions in an all-or-none fashion, whereas the second (‘protein-mode’) uses quantitative sequence similarity searches and often distributes a given interaction fractionally among several protein pairs of the target organism. Both approaches have strengths and weaknesses, and users can choose either one of them before starting their query (a color change helps them to distinguish the modus throughout the user interface).

Figure 3

Transferring association scores between organisms. Initial situation (top): a scored association between two proteins in a source organism—how confidently can it be transferred to a target organism by a postulated association among homologous proteins? Bottom left: in ‘COG-mode’, all proteins in an orthologous group (COG) are considered equivalent. The highest association score between any two proteins in the two COGs is assumed to be valid for all pairs. Bottom right: in ‘protein-mode’, all sequence similarity relations between the two organisms are considered. Associations are transferred fractionally, such that the pair with the highest similarity receives the bulk of the score. The relation is not linear: empiric analysis (not shown) suggests that competing similarity links should be down weighted, relative to the best link, as follows: (i) express similarities as values between zero and one, i.e. normalize by self-hit; (ii) transform similarities using s′ = exp(−k1/s), thereby amplifying their ‘spread’; (iii) re-normalize so that, between the two species, all similarities for a protein family add up to one; (iv) each pair of proteins, A and B in the target species now receives a share of the association score: Starget = Ssource · k2 · s′A · s′B. (optimal values for k1 and k2 were empirically found to be 0.7 for both).

The COG mode requires an assignment of proteins into orthologous groups; all proteins within such a group are assumed to be functionally equivalent across genomes. This orthology information is imported from the COGs database [(21), we extend the groups to cover all organisms in STRING]. Any association score observed between a pair of proteins from two different COGs is assumed to be valid for all protein pairs spanning these two COGs. Repeated observations of links, e.g. occurrence of genes in the same operon, increase the association score—but only when they are observed in phylogenetically distant organisms. In the newly developed protein mode, there is no preassigned orthology information. Instead, the transfer relies on a precomputed all-against-all similarity search of the 730 000 proteins in STRING (using the sensitive Smith-Waterman algorithm). For each association to be transferred, the algorithm searches for potential orthologs of the interacting partners in other genomes. Orthology is assumed if proteins form reciprocal best matches in the searches, in the absence of any close, second-best hits (paralogs) in either species. In such an ideal situation, the interactions can be transferred in toto. However, in reality there will often be additional paralogs in one or both of the genomes, which complicates the transfer. We have devised and benchmarked an empirical scheme that is based on the relative sequence similarity of competing paralogous proteins (Figure 3). Essentially, the pair of proteins exhibiting the highest sequence similarity to the source pair receives the highest ‘share’ of the transferred interaction.

INTEGRATION

After assignment of association scores and transfer between species, we compute a final ‘combined score’ between any pair of proteins (or pair of COGs). This score is often higher than the individual sub-scores, expressing increased confidence when an association is supported by several types of evidence (Table 1). It is computed under the assumption of independence for the various sources, in a naïve Bayesian fashion. It is thus a simple expression of the individual scores:

Table 1.

The number of associations stored in STRING, shown separately for each data source and confidence range (low confidence: scores <0.4; medium: 0.4 to 0.7; high: >0.7)

Association evidence type	Low confidence	Medium confidence	High confidence
Conserved neighborhood	441 385	111 785	32 401
Gene fusions	33 578	3765	4393
Phylogenetic co-occurrence	10 634 199	1 091 997	175 774
Co-expression	4 556 189	615,825	29 879
Database imports	38 638	11 193	7,695
Large-scale experiments	207 829	26 226	3,401
Literature co-occurrence	413 521	154 594	55 387
Combined score	15 498 524	1 937 091	368 669

The confidence increases when methods are combined (e.g. there are more high-confidence links in the last row than the simple sum). Data from version 5.1 of STRING.

The assumption of independence is valid here because datasets that are based on similar technologies (e.g. different yeast two-hybrid datasets) have been joined previously and are benchmarked as a single information source. Along with the combined score, the individual sub-scores are always displayed as well, because they provide valuable information about the nature of a particular association.

22 in total

1. Mining literature for protein-protein interactions.

Authors: E M Marcotte; I Xenarios; D Eisenberg
Journal: Bioinformatics Date: 2001-04 Impact factor: 6.937

2. STRING: a database of predicted functional associations between proteins.

Authors: Christian von Mering; Martijn Huynen; Daniel Jaeggi; Steffen Schmidt; Peer Bork; Berend Snel
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

3. The European Bioinformatics Institute's data resources.

Authors: Catherine Brooksbank; Evelyn Camon; Midori A Harris; Michele Magrane; Maria Jesus Martin; Nicola Mulder; Claire O'Donovan; Helen Parkinson; Mary Ann Tuli; Rolf Apweiler; Ewan Birney; Alvis Brazma; Kim Henrick; Rodrigo Lopez; Guenter Stoesser; Peter Stoehr; Graham Cameron
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

4. BIND: the Biomolecular Interaction Network Database.

Authors: Gary D Bader; Doron Betel; Christopher W V Hogue
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

5. A gene-coexpression network for global discovery of conserved genetic modules.

Authors: Joshua M Stuart; Eran Segal; Daphne Koller; Stuart K Kim
Journal: Science Date: 2003-08-21 Impact factor: 47.728

Review 6. Prediction of protein-protein interactions from evolutionary information.

Authors: Alfonso Valencia; Florencio Pazos
Journal: Methods Biochem Anal Date: 2003

Review 7. Function prediction and protein networks.

Authors: Martijn A Huynen; Berend Snel; Christian von Mering; Peer Bork
Journal: Curr Opin Cell Biol Date: 2003-04 Impact factor: 8.382

8. The Genome Knowledgebase: a resource for biologists and bioinformaticists.

Authors: G Joshi-Tope; I Vastrik; G R Gopinath; L Matthews; E Schmidt; M Gillespie; P D'Eustachio; B Jassal; S Lewis; G Wu; E Birney; L Stein
Journal: Cold Spring Harb Symp Quant Biol Date: 2003

9. Functional organization of the yeast proteome by systematic analysis of protein complexes.

Authors: Anne-Claude Gavin; Markus Bösche; Roland Krause; Paola Grandi; Martina Marzioch; Andreas Bauer; Jörg Schultz; Jens M Rick; Anne-Marie Michon; Cristina-Maria Cruciat; Marita Remor; Christian Höfert; Malgorzata Schelder; Miro Brajenovic; Heinz Ruffner; Alejandro Merino; Karin Klein; Manuela Hudak; David Dickson; Tatjana Rudi; Volker Gnau; Angela Bauch; Sonja Bastuck; Bettina Huhse; Christina Leutwein; Marie-Anne Heurtier; Richard R Copley; Angela Edelmann; Erich Querfurth; Vladimir Rybin; Gerard Drewes; Manfred Raida; Tewis Bouwmeester; Peer Bork; Bertrand Seraphin; Bernhard Kuster; Gitte Neubauer; Giulio Superti-Furga
Journal: Nature Date: 2002-01-10 Impact factor: 49.962

Review 10. MINT: a Molecular INTeraction database.

Authors: Andreas Zanzoni; Luisa Montecchi-Palazzi; Michele Quondam; Gabriele Ausiello; Manuela Helmer-Citterich; Gianni Cesareni
Journal: FEBS Lett Date: 2002-02-20 Impact factor: 4.124

550 in total

1. IMID: integrated molecular interaction database.

Authors: Sentil Balaji; Charles Mcclendon; Rajesh Chowdhary; Jun S Liu; Jinfeng Zhang
Journal: Bioinformatics Date: 2012-01-11 Impact factor: 6.937

2. A systemic network for Chlamydia pneumoniae entry into human cells.

Authors: Anyou Wang; S Claiborne Johnston; Joyce Chou; Deborah Dean
Journal: J Bacteriol Date: 2010-03-16 Impact factor: 3.490

3. SH2 domains recognize contextual peptide sequence information to determine selectivity.

Authors: Bernard A Liu; Karl Jablonowski; Eshana E Shah; Brett W Engelmann; Richard B Jones; Piers D Nash
Journal: Mol Cell Proteomics Date: 2010-07-13 Impact factor: 5.911

4. Role of GlnR in acid-mediated repression of genes encoding proteins involved in glutamine and glutamate metabolism in Streptococcus mutans.

Authors: Pei-Min Chen; Yi-Ywan M Chen; Sung-Liang Yu; Singh Sher; Chern-Hsiung Lai; Jean-San Chia
Journal: Appl Environ Microbiol Date: 2010-02-19 Impact factor: 4.792

5. HTAPP: high-throughput autonomous proteomic pipeline.

Authors: Kebing Yu; Arthur R Salomon
Journal: Proteomics Date: 2010-06 Impact factor: 3.984

6. Environments shape the nucleotide composition of genomes.

Authors: Konrad U Foerstner; Christian von Mering; Sean D Hooper; Peer Bork
Journal: EMBO Rep Date: 2005-12 Impact factor: 8.807

7. Nonsense-mediated mRNA decay factors act in concert to regulate common mRNA targets.

Authors: Jan Rehwinkel; Ivica Letunic; Jeroen Raes; Peer Bork; Elisa Izaurralde
Journal: RNA Date: 2005-10 Impact factor: 4.942

8. Gene function prediction based on genomic context clustering and discriminative learning: an application to bacteriophages.

Authors: Jason Li; Saman K Halgamuge; Christopher I Kells; Sen-Lin Tang
Journal: BMC Bioinformatics Date: 2007-05-22 Impact factor: 3.169

9. Computational prediction and experimental verification of the gene encoding the NAD+/NADP+-dependent succinate semialdehyde dehydrogenase in Escherichia coli.

Authors: Tobias Fuhrer; Lifeng Chen; Uwe Sauer; Dennis Vitkup
Journal: J Bacteriol Date: 2007-09-14 Impact factor: 3.490

10. Nano-LC-Q-TOF Analysis of Proteome Revealed Germination of Aspergillus flavus Conidia is Accompanied by MAPK Signalling and Cell Wall Modulation.

Authors: Shraddha Tiwari; Raman Thakur; Gunjan Goel; Jata Shankar
Journal: Mycopathologia Date: 2016-08-30 Impact factor: 2.574