| Literature DB >> 18673526 |
Samira Jaeger1, Sylvain Gaudan, Ulf Leser, Dietrich Rebholz-Schuhmann.
Abstract
BACKGROUND: Functional annotation of proteins remains a challenging task. Currently the scientific literature serves as the main source for yet uncurated functional annotations, but curation work is slow and expensive. Automatic techniques that support this work are still lacking reliability. We developed a method to identify conserved protein interaction graphs and to predict missing protein functions from orthologs in these graphs. To enhance the precision of the results, we furthermore implemented a procedure that validates all predictions based on findings reported in the literature.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18673526 PMCID: PMC2500093 DOI: 10.1186/1471-2105-9-S8-S2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Example CCS detected between . The figure shows a conserved and connected subgraph between H. sapiens (circle) and S. cerevisiae (hexagon). Proteins of both species are involved in mRNA splicing and are known to exhibit splicing factor activity to bind the mRNA and support the splicing process. (Solid lines represent conserved PPIs within a species and dashed lines indicate orthology relationships between proteins.)
Figure 2Schematic illustration for comparing GO annotations in this study. The flowchart summarizes all four combinations for comparing GO annotations of different resources. Protein – annotation associations were extracted from text with or without species identification and GO annotations from text and the UniProtKb/Swiss-Prot database were compared based on exact versus relative matching.
Figure 3Schematic overview of the studied protein annotations. Schematic overview of the different sets of protein – annotation associations considered in this study.
Evidences for protein – GO annotation associations in text for Set 1. Evidences in the literature for annotations of randomly chosen orthologous proteins – Set 1 – compared with relative matching.
| Species | Recall |
| DM | 470/2046 (23.0%) |
| MM | 859/3141 (27.3%) |
| SC | 2747/4974 (55.2%) |
| HS | 2801/5419 (51.7%) |
Evidences for protein – GO annotation associations in text for Set 2. Evidences in the literature for annotations from UniProtKb/Swiss-Prot considering only proteins of structurally conserved subgraphs – Set 2 – compared with relative matching.
| PPI Comparison | Recall |
| DM-SC | 34/78 (43.6%) |
| HS-DM | 149/427 (35.0%) |
| HS-SC | 1002/1796 (56.0%) |
| HS-MM | 3083/6119 (49.6%) |
Evidences for protein – GO annotation associations in text for Set 3. Comparing newly predicted GO terms (Set 3a) and known GO terms (Set 3b) from UniProtKb/Swiss-Prot with protein – GO annotation associations in Medline using different extraction criteria.
| Extraction criteria | GO term Set | Recall |
| Exact & Species | predicted GO terms | 19/88 (22%) |
| known GO terms | 129/283 (46%) | |
| Relative & Species | predicted GO terms | 21/88 (24%) |
| known GO terms | 164/283 (58%) | |
| Exact | predicted GO terms | 31/88 (35%) |
| known GO terms | 201/283 (71%) | |
| Relative | new GO terms | 34/88 (39%) |
| known GO terms | 234/283 (82%) |
Redundancy of protein – GO term associations in Medline. Median, maximum and average frequencies of protein – GO term associations for proteins of Set 2 and 3 in Medline.
| Protein Set | Frequencies | Total | MF | BP | CC |
| Set 2 | median | 11 | 15 | 10 | 7 |
| max. | 19855 | 2990 | 19855 | 3132 | |
| mean | 72 | 77 | 89 | 25 | |
| Set 3a (predicted) | median | 9 | 12 | 18 | 3 |
| max. | 907 | 907 | 88 | 66 | |
| mean | 47 | 55 | 40 | 3 | |
| Set 3b (known) | median | 33 | 51 | 18 | 27 |
| max. | 6566 | 6566 | 2053 | 1823 | |
| mean | 199 | 328 | 153 | 87 |
Distribution of confirmed GO terms across the three subontologies of GO. Subontology specific consideration of known GO terms (Set 3b) confirmed by literature.
| Extraction criteria | Recall – MF | Recall – BP | Recall – CC |
| Exact & Species | 56/107 (52%) | 31/85 (36%) | 42/91 (46%) |
| Relative & Species | 71/107 (66%) | 41/85 (48%) | 52/91 (57%) |
| Exact | 83/107 (77%) | 51/85 (60%) | 67/91 (73%) |
| Relative | 90/107 (84%) | 69/85 (81%) | 75/91 (82%) |
Distribution of confirmed GO terms by significance and evidence. Distribution of the identified terms over the list specified by the GoTagger separated into predicted (3a) and known annotations (3b).
| predicted GO terms | known GO terms | |||||||||
| Extraction criteria | # terms | 1–5 | 6–10 | 11–20 | >20 | # terms | 1–5 | 6–10 | 11–20 | >20 |
| Exact & Species | 19 | 5 | 6 | 3 | 5 | 129 | 60 | 35 | 16 | 18 |
| Relative & Species | 21 | 10 | 8 | 2 | 1 | 164 | 82 | 37 | 20 | 25 |
| Exact | 31 | 18 | 7 | 3 | 3 | 201 | 125 | 39 | 22 | 15 |
| Relative | 34 | 25 | 5 | 3 | 1 | 234 | 163 | 36 | 17 | 18 |