| Literature DB >> 27153729 |
Tunca Doğan1, Alistair MacDougall1, Rabie Saidi1, Diego Poggioli1, Alex Bateman1, Claire O'Donovan1, Maria J Martin1.
Abstract
MOTIVATION: Similarity-based methods have been widely used in order to infer the properties of genes and gene products containing little or no experimental annotation. New approaches that overcome the limitations of methods that rely solely upon sequence similarity are attracting increased attention. One of these novel approaches is to use the organization of the structural domains in proteins.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27153729 PMCID: PMC4965628 DOI: 10.1093/bioinformatics/btw114
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.(A) Schematic representation of the method; (B) Representation of pairwise DA alignment between two proteins; (C) GO MF DAG; nodes: all terms (blue), predicted terms (red)
Fig. 2.Cross-validation results: (A) ROC and precision versus recall curves for a GO term class; (B) Performance of the method as F-score and (C) as Precision (Color version of this figure is available at Bioinformatics online.)
Statistics of the DA generation on UniProtKB databases
| Database UniProtKB/: | Swiss-Prot (v2014_11) | TrEMBL (v2015_12) |
|---|---|---|
| No. of input protein entries: | 547 084 | 55 270 679 |
| No. of entries with InterPro domain hits: | 407 247 | 35 564 711 |
| No. of unique DAs generated: | 54 388 | 1 148 372 |
Fig. 3.Number of domains per protein versus performance in cross-validation graph (Color version of this figure is available at Bioinformatics online.)
Statistics of DAAC application results on UniProtKB/TrEMBL and comparison to the current annotation in the database
| Predictions (No. of proteins in brackets) | Ratio (on % of proteins) | |
|---|---|---|
| Total no. of: | 44 818 178 (12 172 114) | 100% (100%) |
| No. of new: | 10 020 251 (2 812 016) | 22% (23%) |
| No. of identical: | 6 607 303 (5 065 640) | 15% (42%) |
| No. of similar (total): | 20 755 459 (7 342 619) | 46% (60%) |
| No. of similar (specific): | 15 358 089 (5 877 438) | 34% (48%) |
| No. of similar (generic): | 4 966 612 (2 879 775) | 12% (24%) |
| No. of differential: | 7 435 165 (3 303 747) | 17% (27%) |
Coverage increase in UniProtKB/TrEMBL database: 8.0%.
Two example cases where multiple domains are required for the defined protein function
| GO id | GO term name | Associated Das | No. of training proteins | Association confidence (F-score) | No. of query annotated proteins |
|---|---|---|---|---|---|
| GO:0004653 | polypeptide N-acetylgalactosaminyltransferase activity | 1) GAP-IPR001173 -IPR000772 2) GAP-IPR001173 -GAP-IPR000772 | 26 | 1.00 | 5740 |
| GO:0042813 | Wnt-activated receptor activity | 1) GAP-IPR020067-GAP-IPR0179812) IPR020067-GAP-IPR017981 3) IPR020067-IPR008993 | 25 | 0.91 | 1298 |
Statistics and performance comparison between InterPro2GO and DAAC
| InterPro2GO | DAAC | |
|---|---|---|
| Total no. of mappings | 6382 | 25626 |
| No. of unique entries | 2927 | 8248 |
| No. of unique GO terms | 1411 | 778 |
| No. of GO terms predicted by each system | 1188 | 555 |
| (No. of shared terms: 223) | ||
| No. of mapped GO term relations with the other system | 760 in relation651 independent | 625 in relation153 independent |
| % specificity of the mapped GO terms compared to other system | 19% | 75% |
| (6% the same term) | ||
| F-score | 0.675 | 0.874 |
| Recall | 0.615 | 0.843 |
| Precision | 0.909 | 0.919 |
| FPR (fall-out) | 1.98 × 10−5 | 4.57 × 10−4 |