Literature DB >> 18940858

STRING 8--a global view on proteins and their functional interactions in 630 organisms.

Lars J Jensen¹, Michael Kuhn, Manuel Stark, Samuel Chaffron, Chris Creevey, Jean Muller, Tobias Doerks, Philippe Julien, Alexander Roth, Milan Simonovic, Peer Bork, Christian von Mering.

Abstract

Functional partnerships between proteins are at the core of complex cellular phenotypes, and the networks formed by interacting proteins provide researchers with crucial scaffolds for modeling, data reduction and annotation. STRING is a database and web resource dedicated to protein-protein interactions, including both physical and functional interactions. It weights and integrates information from numerous sources, including experimental repositories, computational prediction methods and public text collections, thus acting as a meta-database that maps all interaction evidence onto a common set of genomes and proteins. The most important new developments in STRING 8 over previous releases include a URL-based programming interface, which can be used to query STRING from other resources, improved interaction prediction via genomic neighborhood in prokaryotes, and the inclusion of protein structures. Version 8.0 of STRING covers about 2.5 million proteins from 630 organisms, providing the most comprehensive view on protein-protein interactions currently available. STRING can be reached at http://string-db.org/.

Entities: Chemical Disease Species

Mesh：

Substances：

Year: 2008 PMID： 18940858 PMCID： PMC2686466 DOI： 10.1093/nar/gkn760

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

In contrast to genome sequences, which are quickly becoming a commodity, the functional connectivity within a proteome is a much more challenging problem. The various protein complexes, transient interactions and functional pathways are all context-dependent, and the experimental techniques for their elucidation are diverse, often not directly comparable, and less reliable than genome sequencing. Nevertheless, protein–protein interaction networks (or also ‘association networks’ in case functional associations are included) are a crucial ingredient for any system-level understanding of cellular machineries (1–5). Furthermore, protein networks can serve very concrete, practical purposes such as filtering and assessing high-throughput functional genomics data, and providing intuitive visual scaffolds for annotating the structural, functional and evolutionary properties of proteins. The database and web-tool STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) is a meta-resource that aggregates most of the available information on protein–protein associations, scores and weights it, and augments it with predicted interactions, as well as with the results of automatic literature-mining searches. Since its first release in 2000 (6), it has grown into the most comprehensive resource of its type. It builds upon and extends the excellent, manual annotation efforts undertaken at primary protein interaction databases (7–12) and at databases of curated pathway knowledge (13–15). Here, we describe new features that have been added since our report on the previous release, STRING 7 (16).

EXTENDING THE SOURCES OF INTERACTION INFORMATION

The basic interaction unit in STRING is the ‘functional association’, which is defined in this database as the specific and meaningful interaction between two proteins that jointly contribute to the same functional process. With respect to the interacting proteins, STRING does not consider any specific splicing isoforms or posttranslational modifications, but instead represents each protein-coding locus in a genome by a single protein (the longest isoform). Thus, and because STRING aggregates data and predictions stemming from a wide spectrum of cell types and environmental conditions, it aims to represent the union of all possible protein–protein links. From this union, the actual network for any given spatio-temporal snapshot of the cell can in principle be deduced by projection, for example by removing proteins known to be not expressed or not active under the conditions studied (17). In keeping with the above definitions, STRING imports protein association knowledge not only from databases of physical interactions, but also from databases of curated biological pathway knowledge. Apart form the resources already included in the previous release [MINT (10), HPRD (9), BIND (12), DIP (11), BioGRID (8), KEGG (13) and Reactome (14)], a number of resources have been newly included [IntAct (7), EcoCyc (15), NCI-Nature Pathway Interaction Database and Gene Ontology (GO) protein complexes]. For the full STRING release, this set of previously known and well-described interactions is then complemented by interactions that are predicted computationally, specifically for STRING, using a number of prediction algorithms (18,19). First, we conduct systematic searches for genes that are found in close proximity within prokaryotic chromosomes, which is a good indicator for functional linkage. Second, we search for instances where genes have joined to encode a single fusion protein, which is indicative of functional linkage even in organisms where the two proteins have not fused. Third, we search for gene families that share above-random similarities in their evolutionary histories (i.e. they have similar ‘phylogenetic profiles’). This, again, predicts that they contribute to similar functional processes in the cell. Fourth, we conduct searches for genes that display a similar transcriptional response across a variety of conditions (co-expression). Individually, the above predictors may not always have the specificity of direct experimental interaction assays; however, when used in concert and integrated probabilistically, the performance even of relatively weak predictors can rival that of experimental data (20). Lastly, two further sources of interactions in STRING are actually providing the majority of associations; these are text-mining and interaction transfer between organisms. For the former, we parse a large body of scientific texts [SGD (21), OMIM (22), The Interactive Fly, and all abstracts from PubMed]. We search for statistically relevant co-occurrences of gene names, and also extract a subset of semantically specified interactions using Natural Language Processing (23). For the transfer of interactions between organisms, we estimate whether a pair of interacting proteins found conserved in another organism justifies the transfer of the interaction to that other organism (24). The transferred interactions, as well as all predicted or imported interactions, are benchmarked and scored against a common reference of functional partnership [we currently use the joint membership of proteins in biological pathways, as annotated at KEGG (13), as our gold-standard]. Together, the above sources of interactions, including predictions and transfers, result in a uniquely high coverage of the interaction networks stored in STRING (Figure 1), particularly for well-studied model organisms. Since the previous release, STRING has almost doubled the number of supported organisms, which now stands at 630. The number of stored interactions has increased as well, to a total of more than 50 million. Since the various subtypes of the interaction evidence are stored separately in the database, they can be disabled at will—giving users the ability to adjust the scope and specificity of STRING towards their particular application.

Figure 1.

Protein association network in STRING. An example of the network view in STRING, centered on the query protein ‘hisB’ from Escherichia coli. The inset shows the annotations and options that are available for each protein, including references to other databases. Three ‘functional modules’ can readily be seen in the network, forming tightly connected clusters. These encompass histidine biosynthesis, branched-chain amino acid biosynthesis, and—less strongly connected—a part of fatty acid biosynthesis. Line color indicates the type of the supporting evidence; all underlying evidence can be inspected in dedicated viewers that are accessible from the network.

EXTENDED DEFINITION OF CONSERVED GENOMIC NEIGHBORHOOD

When working with prokaryotes, scientists have long used conserved genomic neighborhood arrangements of genes to infer functional linkage, assuming that such arrangements reflect polycistronic transcription units (operons). STRING has followed this principle, compiling and benchmarking protein–protein associations based on close, co-directional neighborhood of genes on the genome. As of version 8, this has been extended to cover also neighboring genes that are counter-directional in a head-to-head orientation (‘divergent transcription’). Such divergently oriented gene pairs have been shown to be indicative of functional linkage as well (25), albeit with somewhat lower confidence. Often, one of the two genes is a transcriptional regulator, targeting the neighboring gene (25). STRING now uses this type of arrangement in its neighborhood algorithm as well (benchmarked separately, Figure 2). In addition, STRING is now more error tolerant when assembling conserved neighborhoods, ignoring short, partially overlapping genes on the antisense strand that are likely to be spurious predictions.

Figure 2.

Extended definition of genomic neighborhood. (A) Illustration of a conserved gene neighborhood, containing genes related to the biosynthesis and consumption of tryptophan (simplified from a STRING screenshot). Genes connected by lines are direct neighbors on the chromosome, and genes with similar colors are orthologs across the various organisms. The arrow marks a switch in gene orientation, leading to a head-to-head orientation of two presumptive operons. (B) Divergently oriented genes predict functional linkage in prokaryotes. Each dot summarizes a group (bin) of gene pairs with similar intergenic distances. The fraction of such pairs where both genes are annotated in the same KEGG pathway is indicated, implying functional partnership. Note that divergent gene pairs are slightly shifted towards larger intergenic distances, presumably to accommodate promoters and regulatory sequences.

INTEGRATION OF PROTEIN STRUCTURES

For each update, STRING now parses all entries of the PDB database of protein structures (26). The use of protein structures is two-fold: first, to inform the user that a given protein—or a close homolog thereof—indeed has 3D structure information. In this case, a small preview of a representative structure is shown in the network, and the user can follow it to view the full structure and to proceed to the PDB website. Second, protein structures serve as interaction evidence themselves, when more than one distinct peptide chain is found in the structure. In this case, a stable and reliable protein–protein interaction is assumed.

NEW PROGRAMMING INTERFACE

To facilitate the integration of STRING into network tools like Cytoscape (27) and workflow engines like Taverna (28), we have created an application programming interface (API) that allows access to the interaction network in computer-readable formats (Figure 3). Additionally, specific API functions allow retrieval of individual records from our database, for example to map a protein via its name onto a STRING entry. We further envision that the STRING API will be useful to developers of web services, who plan to make use of the STRING interaction network. If a particular web service needs access to the complete set of interactions, it may still be advisable to maintain a local copy of our data distribution. However, if the service requires access to many different subsets (depending on user input), querying STRING via its API could reduce administrative load.

Figure 3.

The new Application Programming Interface, and how it connects to Cytoscape. Specific items of interest can be retrieved from STRING by constructing URLs accordingly (see Table). Unless STRING's internal identifiers are known, an initial call with the ‘resolve’-request is recommended, to map query items to nodes in the STRING network. TSV, tab-separated values; JSON, JavaScript Object Notation; PSI-MI 2.5, Proteomics Standards Initiative Molecular Interaction (XML and tab-delimited format). *Requests ending on ‘List’ accept more than one input item, but are otherwise identical (multiple query items must be separated by URL-encoded ‘new-line’ characters).

USE SCENARIOS

Apart from the ad hoc and barrier-free access through the website, STRING can be downloaded and used locally, either in the form of concise flat-files or as a mirror installation of the complete relational database back-end (some of the downloads do require a free, nonredistribution license applicable to academic nonprofit users). The interacting entities in STRING can be set to be either proteins, or groups of orthologs spanning multiple organisms (‘COG-mode’). For the latter, STRING relies on an updated and extended version of the COGs [‘Clusters of Orthologous Groups’ (29)], which is being maintained at the eggNOG database (30). A variety of other databases use STRING networks as a basis for further computations/annotations, for example by augmenting the networks with small molecules [STITCH, (31)], or by using the network to increase the power of kinase–substrate predictions [NetworKIN, (32)]. STRING has also been integrated into third-party tools such as NeAT [Network Analysis Tools, (33)], which provides various ways to analyze the interaction network, or Gaggle (34), which enables automated data transfer into other tools via a browser add-on.

FUNDING

Swiss Institute of Bioinformatics; University of Zurich through its Research Priority Program ‘Systems Biology and Functional Genomics’; European Commission's FP6 Programme through the ADIT Integrated Project (LSHB-CT-2005-511065); BioSapiens Network of Excellence (LSHG-CT-2003-503265). Funding for open access charge: University of Zurich.

33 in total

1. STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene.

Authors: B Snel; G Lehmann; P Bork; M A Huynen
Journal: Nucleic Acids Res Date: 2000-09-15 Impact factor: 16.971

2. A Bayesian networks approach for predicting protein-protein interactions from genomic data.

Authors: Ronald Jansen; Haiyuan Yu; Dov Greenbaum; Yuval Kluger; Nevan J Krogan; Sambath Chung; Andrew Emili; Michael Snyder; Jack F Greenblatt; Mark Gerstein
Journal: Science Date: 2003-10-17 Impact factor: 47.728

3. The Database of Interacting Proteins: 2004 update.

Authors: Lukasz Salwinski; Christopher S Miller; Adam J Smith; Frank K Pettit; James U Bowie; David Eisenberg
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

4. Cytoscape: a software environment for integrated models of biomolecular interaction networks.

Authors: Paul Shannon; Andrew Markiel; Owen Ozier; Nitin S Baliga; Jonathan T Wang; Daniel Ramage; Nada Amin; Benno Schwikowski; Trey Ideker
Journal: Genome Res Date: 2003-11 Impact factor: 9.043

5. Taverna: a tool for the composition and enactment of bioinformatics workflows.

Authors: Tom Oinn; Matthew Addis; Justin Ferris; Darren Marvin; Martin Senger; Mark Greenwood; Tim Carver; Kevin Glover; Matthew R Pocock; Anil Wipat; Peter Li
Journal: Bioinformatics Date: 2004-06-16 Impact factor: 6.937

6. Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs.

Authors: Jan O Korbel; Lars J Jensen; Christian von Mering; Peer Bork
Journal: Nat Biotechnol Date: 2004-07 Impact factor: 54.908

7. STRING: known and predicted protein-protein associations, integrated and transferred across organisms.

Authors: Christian von Mering; Lars J Jensen; Berend Snel; Sean D Hooper; Markus Krupp; Mathilde Foglierini; Nelly Jouffre; Martijn A Huynen; Peer Bork
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

8. EcoCyc: a comprehensive database resource for Escherichia coli.

Authors: Ingrid M Keseler; Julio Collado-Vides; Socorro Gama-Castro; John Ingraham; Suzanne Paley; Ian T Paulsen; Martín Peralta-Gil; Peter D Karp
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

9. The Biomolecular Interaction Network Database and related tools 2005 update.

Authors: C Alfarano; C E Andrade; K Anthony; N Bahroos; M Bajec; K Bantoft; D Betel; B Bobechko; K Boutilier; E Burgess; K Buzadzija; R Cavero; C D'Abreo; I Donaldson; D Dorairajoo; M J Dumontier; M R Dumontier; V Earles; R Farrall; H Feldman; E Garderman; Y Gong; R Gonzaga; V Grytsan; E Gryz; V Gu; E Haldorsen; A Halupa; R Haw; A Hrvojic; L Hurrell; R Isserlin; F Jack; F Juma; A Khan; T Kon; S Konopinsky; V Le; E Lee; S Ling; M Magidin; J Moniakis; J Montojo; S Moore; B Muskat; I Ng; J P Paraiso; B Parker; G Pintilie; R Pirone; J J Salama; S Sgro; T Shan; Y Shu; J Siew; D Skinner; K Snyder; R Stasiuk; D Strumpf; B Tuekam; S Tao; Z Wang; M White; R Willis; C Wolting; S Wong; A Wrong; C Xin; R Yao; B Yates; S Zhang; K Zheng; T Pawson; B F F Ouellette; C W V Hogue
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

10. NeAT: a toolbox for the analysis of biological networks, clusters, classes and pathways.

Authors: Sylvain Brohée; Karoline Faust; Gipsi Lima-Mendez; Olivier Sand; Rekin's Janky; Gilles Vanderstocken; Yves Deville; Jacques van Helden
Journal: Nucleic Acids Res Date: 2008-06-04 Impact factor: 16.971

1121 in total

1. Human protein reference database and human proteinpedia as discovery resources for molecular biotechnology.

Authors: Renu Goel; Babylakshmi Muthusamy; Akhilesh Pandey; T S Keshava Prasad
Journal: Mol Biotechnol Date: 2011-05 Impact factor: 2.695

2. Predicting genetic modifier loci using functional gene networks.

Authors: Insuk Lee; Ben Lehner; Tanya Vavouri; Junha Shin; Andrew G Fraser; Edward M Marcotte
Journal: Genome Res Date: 2010-06-09 Impact factor: 9.043

3. SySAP: a system-level predictor of deleterious single amino acid polymorphisms.

Authors: Tao Huang; Chuan Wang; Guoqing Zhang; Lu Xie; Yixue Li
Journal: Protein Cell Date: 2011-12-19 Impact factor: 14.870

4. Intermittent hypoxia activates temporally coordinated transcriptional programs in visceral adipose tissue.

Authors: Sina A Gharib; Abdelnaby Khalyfa; Amal Abdelkarim; Vijay Ramesh; Mohamed Buazza; Navita Kaushal; Bharat Bhushan; David Gozal
Journal: J Mol Med (Berl) Date: 2011-11-16 Impact factor: 4.599

5. Protein-protein interaction networks suggest different targets have different propensities for triggering drug resistance.

Authors: Jyothi Padiadpu; Rohit Vashisht; Nagasuma Chandra
Journal: Syst Synth Biol Date: 2011-02-20

6. SdhE is a conserved protein required for flavinylation of succinate dehydrogenase in bacteria.

Authors: Matthew B McNeil; James S Clulow; Nabil M Wilf; George P C Salmond; Peter C Fineran
Journal: J Biol Chem Date: 2012-04-03 Impact factor: 5.157

7. A collaborative environment for developing and validating predictive tools for protein biophysical characteristics.

Authors: Michael A Johnston; Damien Farrell; Jens Erik Nielsen
Journal: J Comput Aided Mol Des Date: 2012-04-04 Impact factor: 3.686

8. Loss of TMEM106B Ameliorates Lysosomal and Frontotemporal Dementia-Related Phenotypes in Progranulin-Deficient Mice.

Authors: Zoe A Klein; Hideyuki Takahashi; Mengxiao Ma; Massimiliano Stagi; Melissa Zhou; TuKiet T Lam; Stephen M Strittmatter
Journal: Neuron Date: 2017-07-19 Impact factor: 17.173

9. Rare missense neuronal cadherin gene (CDH2) variants in specific obsessive-compulsive disorder and Tourette disorder phenotypes.

Authors: Pablo R Moya; Nicholas H Dodman; Kiara R Timpano; Liza M Rubenstein; Zaker Rana; Ruby L Fried; Louis F Reichardt; Gary A Heiman; Jay A Tischfield; Robert A King; Marzena Galdzicka; Edward I Ginns; Jens R Wendland
Journal: Eur J Hum Genet Date: 2013-01-16 Impact factor: 4.246

10. An RNAi screen for Aire cofactors reveals a role for Hnrnpl in polymerase release and Aire-activated ectopic transcription.

Authors: Matthieu Giraud; Nada Jmari; Lina Du; Floriane Carallis; Thomas J F Nieland; Flor M Perez-Campo; Olivier Bensaude; David E Root; Nir Hacohen; Diane Mathis; Christophe Benoist
Journal: Proc Natl Acad Sci U S A Date: 2014-01-16 Impact factor: 11.205