| Literature DB >> 23203871 |
Andrea Franceschini1, Damian Szklarczyk, Sune Frankild, Michael Kuhn, Milan Simonovic, Alexander Roth, Jianyi Lin, Pablo Minguez, Peer Bork, Christian von Mering, Lars J Jensen.
Abstract
Complete knowledge of all direct and indirect interactions between proteins in a given cell would represent an important milestone towards a comprehensive description of cellular mechanisms and functions. Although this goal is still elusive, considerable progress has been made-particularly for certain model organisms and functional systems. Currently, protein interactions and associations are annotated at various levels of detail in online resources, ranging from raw data repositories to highly formalized pathway databases. For many applications, a global view of all the available interaction data is desirable, including lower-quality data and/or computational predictions. The STRING database (http://string-db.org/) aims to provide such a global perspective for as many organisms as feasible. Known and predicted associations are scored and integrated, resulting in comprehensive protein networks covering >1100 organisms. Here, we describe the update to version 9.1 of STRING, introducing several improvements: (i) we extend the automated mining of scientific texts for interaction information, to now also include full-text articles; (ii) we entirely re-designed the algorithm for transferring interactions from one model organism to the other; and (iii) we provide users with statistical information on any functional enrichment observed in their networks.Entities:
Mesh:
Year: 2012 PMID: 23203871 PMCID: PMC3531103 DOI: 10.1093/nar/gks1094
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Protein–protein associations based on automated text mining
| STRING v9.0 | STRING v9.1 | Fold increase | |
|---|---|---|---|
| Natural language processing | 38 859 | 63 331 | 1.629 |
| Cooccurrence, high confidence | 286 880 | 792 730 | 2.763 |
| Cooccurrence, medium confidence | 1 100 756 | 1 672 222 | 1.519 |
| Cooccurrence, low confidence | 3 214 754 | 4 270 322 | 1.328 |
This table quantifies non-redundant associations extracted by text mining in STRING, at various confidence levels; note that both STRING versions shown here are based on the same set of organisms and proteins. The increase in text-mining interactions is largest in the high confidence bracket, reflecting the increased performance enabled by the extension to full text articles, and by the improved entity recognition engine.
Figure 1.Improved procedure for interaction transfer between organisms. Left: steps 1 and 2 of the functional association transfer pipeline. In the first step, the individual links between proteins are combined into a score between orthologous groups, sequentially, from the strongest link (thick line) to the weakest (thin). Each subsequent score is down-weighted, both based on the similarity of its organism to organisms that have already contributed to the combined scores, and on number of proteins from the same organism inside the orthologous group. In the second step of the transfer pipeline, the links between orthologous groups are transferred back to individual protein pairs belonging to these groups. This is done sequentially from the lowest to highest taxonomy level. In the above example, the two transferred links from the highest taxonomic level (orange links) are penalized for the increase in number of proteins from the target species in one of the orthologous groups. Right: ROC curves indicating the performance of predicted interolog scores, benchmarked against KEGG pathways; an inferred link between two proteins is considered to be a true positive when both proteins are annotated to be together in at least one shared KEGG pathway.
Figure 2.Network visualization and statistical analysis of a user-supplied protein list. The STRING screenshot shows a user-supplied set of genes, here a selection of cancer genes as annotated at the COSMIC database (52). The set is restricted to those genes that are known to pre-dispose to cancer already when mutated in the germline, and that have at least one connection in STRING. The inset illustrates the website’s new functionality for automatically detecting statistically enriched functions or processes in a network. In this example, one of the detected processes (nucleotide excision repair) is of interest and has been selected; STRING automatically highlighted the corresponding nodes in the network, where they are seen to form a densely connected module.