Literature DB >> 27924014

The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible.

Damian Szklarczyk¹, John H Morris², Helen Cook³, Michael Kuhn⁴, Stefan Wyder¹, Milan Simonovic¹, Alberto Santos³, Nadezhda T Doncheva³, Alexander Roth¹, Peer Bork^5,6,7,8, Lars J Jensen⁹, Christian von Mering¹⁰.

Abstract

A system-wide understanding of cellular function requires knowledge of all functional interactions between the expressed proteins. The STRING database aims to collect and integrate this information, by consolidating known and predicted protein-protein association data for a large number of organisms. The associations in STRING include direct (physical) interactions, as well as indirect (functional) interactions, as long as both are specific and biologically meaningful. Apart from collecting and reassessing available experimental data on protein-protein interactions, and importing known pathways and protein complexes from curated databases, interaction predictions are derived from the following sources: (i) systematic co-expression analysis, (ii) detection of shared selective signals across genomes, (iii) automated text-mining of the scientific literature and (iv) computational transfer of interaction knowledge between organisms based on gene orthology. In the latest version 10.5 of STRING, the biggest changes are concerned with data dissemination: the web frontend has been completely redesigned to reduce dependency on outdated browser technologies, and the database can now also be queried from inside the popular Cytoscape software framework. Further improvements include automated background analysis of user inputs for functional enrichments, and streamlined download options. The STRING resource is available online, at http://string-db.org/.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2016 PMID： 27924014 PMCID： PMC5210637 DOI： 10.1093/nar/gkw937

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The flow of information and energy through the cell proceeds along specific and evolved interfaces: across and between nucleotides, proteins, lipids, metabolites and other small molecules. Among these interfaces, those between proteins are arguably among the most important, being biochemically diverse and information-rich, and showing exquisite specificity (1–3). Apart from direct physical binding, proteins also have many other, indirect ways of cooperation and mutual regulation: they can influence each other's production and half-life transcriptionally and post-transcriptionally, exchange reaction products, participate in signal relay mechanisms, or jointly contribute toward specific organismal functions. Together, these direct and indirect interactions constitute ‘functional association’, a useful operational umbrella-term for specific and functionally productive interactions of any type (4–9). Assembling all known and predicted protein functional associations for a given organism results in a protein network of genome-wide functional connectivity. These networks represent a crucial, intermediate level of information aggregation: they are placed between pathway databases at one extreme (which provide mechanistic detail but often have low coverage), and high-throughput experimental interaction discovery and ad hoc predictions at the other extreme (which have high coverage but usually also high levels of false positives). As such, protein networks are ideally suited to serve as scaffolds or filters for further data integration, for visualization and for molecular discovery. They are essential for modern life sciences: protein networks are used to increase discovery power for noisy data sets by ‘network smoothing’ (10,11), help define drug efficiency by network-based ‘drug-disease proximity measures’ (12), help to interpret the results of genome-wide association screens (13–17) and enable the discovery of new molecular players through the ‘guilt by association’ concept (18,19). A number of databases and online resources are dedicated to protein networks, at various levels of abstraction and each with a somewhat different focus/scope. First, individual well-supported protein–protein interactions are curated manually from the published literature, through dedicated efforts by members of the IMEx consortium (20,21), but also as part of more general annotation workflows such as within the UniProt consortium (22). Second, a number of databases assemble larger, genome-wide protein networks that are nevertheless still restricted to experimentally observed interactions only; examples include BioGRID (23), HINT (24), iRefWeb (25) and APID (26). Lastly, resources such as STRING include indirect and predicted interactions on top, aiming for inclusiveness in scope and for maximal coverage. Apart from STRING, this latter group includes GeneMANIA (27), Integrated Multi-species Prediction (28), Integrated Interactions Database (29), HumanNet (17), FunCoup (30) and others. For this group of data resources, it is particularly important to provide interaction weights (such as quality scores or confidence estimates), to allow the users to prune down these inclusive networks, as needed. Within the spectrum of the above resources, STRING aims to set itself apart in three ways: (i) comprehensiveness – it covers the largest number of organisms and uses the widest breadth of input sources, including automated text-mining and computational predictions, (ii) usability – in terms of an intuitive web interface, Cytoscape integration and programmatic access options, and (iii) quality control and traceability – each interaction is annotated with benchmarked confidence scores, separately per evidence type, and the underlying evidence can be tracked to its source. STRING has been maintained continuously since the year 2000, and has already been described in several publications (31–34). Below, we provide a brief overview of the main features, and describe recent technical developments.

DATABASE CONTENT

For each protein–protein association stored in STRING, a score is provided. These scores (i.e., the ‘edge weights’ in each network) represent confidence scores, and are scaled between zero and one. They indicate the estimated likelihood that a given interaction is biologically meaningful, specific and reproducible, given the supporting evidence. For each interaction, the supporting evidence is divided into one or more ‘evidence channels’, depending on the origin and type of the evidence. There are seven channels, and they are assembled, scored and benchmarked separately. In the network visualization on the web frontend, the evidence channels are usually delineated by edges of different color, and each of the channels can be disabled individually by the user, in case some types of evidence might not be considered suitable for a particular question that is being studied. Based on the seven channels, a combined and final confidence score is computed for each interaction, and it is this ‘combined score’ that is typically used as the final measure when building networks or when sorting and filtering interactions. For a given interaction, it is generally a good sign of support when not only the combined score is high, but when there is also more than one evidence channel contributing to the score. Furthermore, it is important to note that the interactions in STRING have gene-locus resolution only: we do not discriminate between different splice isoforms or post-translationally modified forms. Hence, the interacting units in STRING are actually the protein-coding gene loci (represented by their main, canonical protein isoform). Briefly, the seven evidence channels in STRING are (i) The experiments channel: Here, evidence comes from actual experiments in the lab (including biochemical, biophysical, as well as genetic experiments). This channel is populated mainly from the primary interaction databases organized in the IMEx consortium, plus BioGRID. (ii) The database channel: In this channel, STRING collects evidence that has been asserted by a human expert curator; this information is imported from pathway databases. (iii) The textmining channel: Here, STRING searches for mentions of protein names in all PubMed abstracts, in an in-house collection of more than three million fulltext articles, and in other text collections (35,36). Pairs of proteins are given an association score when they are frequently mentioned together in the same paper, abstract or even sentence (relative to how often they are mentioned separately). This score is raised further when it has been possible to parse one or more sentences through Natural Language Processing, and a concept connecting the two proteins was encountered (such as ‘binding’ or ‘phosphorylation by’). (iv) The coexpression channel: For this channel, gene expression data originating from a variety of expression experiments are normalized, pruned and then correlated (34). Pairs of proteins that are consistently similar in their expression patterns, under a variety of conditions, will receive a high association score. In addition to large-scale microarray data, in version 10.5 of STRING, RNAseq expression data are now also processed; this results in the inclusion of 16 previously non-covered organisms into this channel. (v) The neighborhood channel: This channel, and the next two, are genome-based prediction channels, whose functionality is generally most relevant for Bacteria and Archaea. In the neighborhood channel, genes are given an association score where they are consistently observed in each other's genome neighborhood (such as in the case of conserved, co-transcribed ‘operons’). (vi) The fusion channel: Pairs of proteins are given an association score when there is at least one organism where their respective orthologs have fused into a single, protein-coding gene. Finally, (vii) The co-occurrence channel: In this channel, STRING evaluates the phylogenetic distribution of orthologs of all proteins in a given organism. If two proteins show a high similarity in this distribution, i.e. if their orthologs tend to be observed as ‘present’ or ‘absent’ in the same subsets of organisms, then an association score is assigned. For this channel, the details of the STRING implementation have recently been described, separately (37). Apart from direct evidence collected in the seven evidence channels, another important contribution of interactions in STRING comes from the transfer of evidence from one organism to another. This so-called ‘interolog’ transfer (38,39) is based on the observation that orthologs of interacting proteins in one organism are often also interacting in another organism – this inference is the more confident the better the orthology relationships can be established. STRING relies on hierarchical orthology relations imported from the eggNOG database (40), and conducts an all-against-all transfer of interactions, benchmarked separately for each evidence channel. Transfers between closely related organisms are made more confidently, whereas the existence of paralogs (i.e., implied gene duplications) will lower the transfer score. Overall, the biggest benefit of the transfers can be seen for poorly studied organisms, where the fraction of interactions supported by transfers only can be as high as 99%. In contrast, in well-studied model organisms such as Escherichia coli, the corresponding fraction is below 20%.

USER INTERFACE

The protein networks stored in STRING can be accessed in a number of ways. Programmatic access is provided via a REST-API (41), via an R/Bioconductor package (34) and via a mechanism to add additional user-provided interactions, as well as protein-centric information, onto the website (‘data payload’) (32). Studies that require genome-wide networks can refer to the STRING download pages, where the complete interaction scores, as well as accessory information, are available (the downloads are free for academics; commercial users need a license for some of the files). As of version 10.5, the downloads can now be pruned down, prior to receiving the files, by organism (or by groups of organisms), which greatly facilitates subsequent data processing. The most important interface to STRING, however, remains the web frontend (Figure 1). In 2016, it has been completely redesigned from the ground up; this was done in order to remove dependencies on deprecated web technologies such as Adobe Flash. The new website allows easier and more intuitive browsing of the networks and the underlying evidence, and it is tightly integrated with the database backend to provide speedy responses. Users can make search results and gene sets persistent by logging in, and stable URLs are provided on each page to facilitate sharing of results.

Figure 1.

Network and Enrichment Analysis. Combined screenshots from the STRING website, showing results obtained upon entering a set of 31 proteins suspected to be involved in Amyotrophic Lateral Sclerosis (55). The insets are showing (from top to bottom): the accessory information available for a single protein, a reported enrichment of functional connections among the set of proteins, and statistical enrichments detected in functional subsystems. In the bottom inset, one enriched function has been selected, and the corresponding protein nodes in the network are automatically highlighted in color. Importantly, users are now—by default—provided with statistical analysis results for each network. The analysis is done server-side, in the background, so as not to slow down the user experience, and it produces alerts when a network is enriched in certain known functions, or has more interactions (edges) than expected. This is particularly meaningful when users arrive to the website with a set of proteins instead of just a single query protein, as it provides a functional characterization of the set (this feature is increasingly used by STRING users). The enrichment tests are done for a variety of classification systems (Gene Ontology, KEGG, Pfam and InterPro), and employ a Fisher's exact test followed by a correction for multiple testing (42,43).

CYTOSCAPE APP INTEGRATION

The web interface of STRING is designed primarily for users interested in small- to medium-scale networks, whereas the API, R package and download files are mainly intended for bioinformaticians who want to integrate STRING with other resources or perform large-scale network analyses. To bridge the gap between the two, we have developed a so-called App for the Cytoscape software framework (44,45), which allows users to easily retrieve, visualize and analyze networks of hundreds to thousands of proteins via a GUI. The App allows users to query STRING in three different ways from within Cytoscape: by protein names, by disease or by PubMed query. The first of these mirrors the ‘Multiple proteins’ query in the STRING web interface and allows users to retrieve a network for a list of up to 2000 protein names or identifiers from, for example, a proteomics or transcriptomics study. The second option is to retrieve a network for a disease of interest; it first retrieves a list of the top-N human proteins associated with the disease from the DISEASES database (46) and subsequently loads the STRING network for these proteins into Cytoscape. The third option, PubMed query, allows users to retrieve a STRING network pertaining to any topic of interest based on text mining of PubMed abstracts. The app fetches the abstracts for a user-specified query via NCBI E-utilities, counts how many of these mention each protein from the organism of interest, ranks the proteins by comparing these counts to precomputed background counts over entire PubMed and retrieves a STRING network for the top-N proteins. The underlying text mining is performed by the software also used for the text-mining channel in STRING. When a network is retrieved by the App, it comes associated with a large number of node attributes for each protein and edge attributes for each interaction, which can subsequently be used within Cytoscape. These include STRING and UniProt accessions to facilitate cross-linking with other resources, a human-readable name for display purposes and the protein sequence. If a protein was retrieved through a protein name query, we store also the exact query term with which the protein was found. This is helpful when querying for proteins identified in a proteomics or transcriptomics study, since it facilitates subsequent import of tabular data from the study (Figure 2). If available for the organism in question, the App also fetches information on the subcellular localization and tissue expression of each protein from the COMPARTMENTS (47) and TISSUES (48) databases as well as drug target information from Pharos (http://pharos.nih.gov/). For each interaction, the edge attributes include the overall confidence score and the subscores from each individual evidence channel.

Figure 2.

STRING network visualization within Cytoscape. Using the Cytoscape STRING app, a network was retrieved for 78 proteins interacting with TrkA (tropomyosin-related kinase A) 10 min after stimulating neuroblastoma cells with NGF (nerve growth factor) (56). With a confidence cutoff of 0.4, the resulting network contains 182 functional associations between 57 of the proteins; the 21 proteins with no associations to other proteins in the network were removed. Nodes are colored according to the protein abundance (log ratio) compared to the cells before NGF treatment. The confidence score of each interaction is mapped to the edge thickness and opacity. Cytoscape and its hundreds of apps provide numerous ways for users to interact with, visualize and analyze STRING networks (49), including integrating additional data from public repositories or their own experiments, changing visual styles and applying algorithms for network layout, clustering (50), enrichment analysis (51,52) and network analysis (53). In addition to these, the STRING App allows users to modify an already retrieved network in three different ways. First, the confidence cutoff for the imported evidence channels can be increased or decreased, which in the latter case involves fetching additional interactions from STRING. Second, users can expand the network by a user-specified number of interactors that are most closely associated with all network nodes or a selected subset of them. Third, any number of additional nodes can be queried by name and added to the existing network. Furthermore, the App provides a results panel with links to related databases such as UniProt (22), GeneCards (54), Pharos, COMPARTMENTS, TISSUES and DISEASES.

OUTLOOK

The availability of completely sequenced genomes, and of protein–protein interaction data, continues to grow quickly. Hence, the data importing and processing for STRING will be further streamlined in order to accommodate this. The upcoming version 11 of STRING will cover more than 4000 organisms, and will contain pre-computed protein networks for all of them. We are also developing a separate and distinctive interface specifically for the investigation of virus-host protein–protein interactions, which will incorporate many of the evidence channels present in STRING. This specialized database will enable querying for a whole virus or for specific viral proteins and will superimpose the viral interaction network onto that of the host. Furthermore, we plan to extend the analysis options for user-provided gene set input, addressing a frequently expressed user need. This will include the possibility to report statistical enrichments for ranked genes lists, even genome-wide rankings. Together with the up-to-date network information, this will allow users to extract the maximum functional information from their queries, for any organism of interest.

55 in total

1. The identification of functional modules from the genomic association of genes.

Authors: Berend Snel; Peer Bork; Martijn A Huynen
Journal: Proc Natl Acad Sci U S A Date: 2002-04-30 Impact factor: 11.205

2. Cytoscape: a software environment for integrated models of biomolecular interaction networks.

Authors: Paul Shannon; Andrew Markiel; Owen Ozier; Nitin S Baliga; Jonathan T Wang; Daniel Ramage; Nada Amin; Benno Schwikowski; Trey Ideker
Journal: Genome Res Date: 2003-11 Impact factor: 9.043

Review 3. Network.assisted analysis to prioritize GWAS results: principles, methods and perspectives.

Authors: Peilin Jia; Zhongming Zhao
Journal: Hum Genet Date: 2014-02 Impact factor: 4.132

4. Functional associations of proteins in entire genomes by means of exhaustive detection of gene fusions.

Authors: A J Enright; C A Ouzounis
Journal: Genome Biol Date: 2001 Impact factor: 13.583

5. The BioGRID interaction database: 2015 update.

Authors: Andrew Chatr-Aryamontri; Bobby-Joe Breitkreutz; Rose Oughtred; Lorrie Boucher; Sven Heinicke; Daici Chen; Chris Stark; Ashton Breitkreutz; Nadine Kolas; Lara O'Donnell; Teresa Reguly; Julie Nixon; Lindsay Ramage; Andrew Winter; Adnane Sellam; Christie Chang; Jodi Hirschman; Chandra Theesfeld; Jennifer Rust; Michael S Livstone; Kara Dolinski; Mike Tyers
Journal: Nucleic Acids Res Date: 2014-11-26 Impact factor: 19.160

6. UniProt: a hub for protein information.

Authors:
Journal: Nucleic Acids Res Date: 2014-10-27 Impact factor: 16.971

7. STRING v9.1: protein-protein interaction networks, with increased coverage and integration.

Authors: Andrea Franceschini; Damian Szklarczyk; Sune Frankild; Michael Kuhn; Milan Simonovic; Alexander Roth; Jianyi Lin; Pablo Minguez; Peer Bork; Christian von Mering; Lars J Jensen
Journal: Nucleic Acids Res Date: 2012-11-29 Impact factor: 16.971

8. Biological network analysis with CentiScaPe: centralities and experimental dataset integration.

Authors: Giovanni Scardoni; Gabriele Tosadori; Mohammed Faizan; Fausto Spoto; Franco Fabbri; Carlo Laudanna
Journal: F1000Res Date: 2014-07-01

9. Functional association networks as priors for gene regulatory network inference.

Authors: Matthew E Studham; Andreas Tjärnberg; Torbjörn E M Nordling; Sven Nelander; Erik L L Sonnhammer
Journal: Bioinformatics Date: 2014-06-15 Impact factor: 6.937

10. Network-based in silico drug efficacy screening.

Authors: Emre Guney; Jörg Menche; Marc Vidal; Albert-László Barábasi
Journal: Nat Commun Date: 2016-02-01 Impact factor: 14.919

2000 in total

1. Genome-wide discovery of epistatic loci affecting antibiotic resistance in Neisseria gonorrhoeae using evolutionary couplings.

Authors: Benjamin Schubert; Rohan Maddamsetti; Jackson Nyman; Maha R Farhat; Debora S Marks
Journal: Nat Microbiol Date: 2018-12-03 Impact factor: 17.745

2. Proteome-Wide Analysis of Cysteine Reactivity during Effector-Triggered Immunity.

Authors: Evan W McConnell; Philip Berg; Timothy J Westlake; Katherine M Wilson; George V Popescu; Leslie M Hicks; Sorina C Popescu
Journal: Plant Physiol Date: 2018-12-03 Impact factor: 8.340

3. NetGO 2.0: improving large-scale protein function prediction with massive sequence, text, domain, family and network information.

Authors: Shuwei Yao; Ronghui You; Shaojun Wang; Yi Xiong; Xiaodi Huang; Shanfeng Zhu
Journal: Nucleic Acids Res Date: 2021-07-02 Impact factor: 16.971

4. The cardiac syndecan-4 interactome reveals a role for syndecan-4 in nuclear translocation of muscle LIM protein (MLP).

Authors: Sabrina Bech Mathiesen; Marianne Lunde; Jan Magnus Aronsen; Andreas Romaine; Anita Kaupang; Marita Martinsen; Gustavo Antonio de Souza; Tuula A Nyman; Ivar Sjaastad; Geir Christensen; Cathrine Rein Carlson
Journal: J Biol Chem Date: 2019-04-09 Impact factor: 5.157

5. Functional network community detection can disaggregate and filter multiple underlying pathways in enrichment analyses.

Authors: Lia X Harrington; Gregory P Way; Jennifer A Doherty; Casey S Greene
Journal: Pac Symp Biocomput Date: 2018

6. Genome-wide single-nucleotide resolution of oxaliplatin-DNA adduct repair in drug-sensitive and -resistant colorectal cancer cell lines.

Authors: Courtney M Vaughn; Christopher P Selby; Yanyan Yang; David S Hsu; Aziz Sancar
Journal: J Biol Chem Date: 2020-04-16 Impact factor: 5.157

7. Molecular Origins of Complex Heritability in Natural Genotype-to-Phenotype Relationships.

Authors: Christopher M Jakobson; Daniel F Jarosz
Journal: Cell Syst Date: 2019-05-01 Impact factor: 10.304

8. Single-Cell RNA Sequencing Reveals Expanded Clones of Islet Antigen-Reactive CD4⁺ T Cells in Peripheral Blood of Subjects with Type 1 Diabetes.

Authors: Karen Cerosaletti; Fariba Barahmand-Pour-Whitman; Junbao Yang; Hannah A DeBerg; Matthew J Dufort; Sara A Murray; Elisabeth Israelsson; Cate Speake; Vivian H Gersuk; James A Eddy; Helena Reijonen; Carla J Greenbaum; William W Kwok; Erik Wambre; Martin Prlic; Raphael Gottardo; Gerald T Nepom; Peter S Linsley
Journal: J Immunol Date: 2017-05-31 Impact factor: 5.422

9. MPO Promoter Polymorphism rs2333227 Enhances Malignant Phenotypes of Colorectal Cancer by Altering the Binding Affinity of AP-2α.

Authors: Qingtao Meng; Shenshen Wu; Yajie Wang; Jin Xu; Hao Sun; Runze Lu; Na Gao; Hongbao Yang; Xiaobo Li; Boping Tang; Michael Aschner; Rui Chen
Journal: Cancer Res Date: 2018-03-14 Impact factor: 12.701

10. Upregulated Expression of CUX1 Correlates with Poor Prognosis in Glioma Patients: a Bioinformatic Analysis.

Authors: Xiujie Wu; Fan Feng; Chuanchao Yang; Moxuan Zhang; Yanhao Cheng; Yayun Zhao; Yayu Wang; Fengyuan Che; Jian Zhang; Xueyuan Heng
Journal: J Mol Neurosci Date: 2019-08-03 Impact factor: 3.444