| Literature DB >> 15960836 |
Frédéric Ehrler1, Antoine Geissbühler, Antonio Jimeno, Patrick Ruch.
Abstract
BACKGROUND: In the context of the BioCreative competition, where training data were very sparse, we investigated two complementary tasks: 1) given a Swiss-Prot triplet, containing a protein, a GO (Gene Ontology) term and a relevant article, extraction of a short passage that justifies the GO category assignment; 2) given a Swiss-Prot pair, containing a protein and a relevant article, automatic assignment of a set of categories.Entities:
Mesh:
Year: 2005 PMID: 15960836 PMCID: PMC1869016 DOI: 10.1186/1471-2105-6-S1-S23
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
GO term per record in DSI.
| # GO term | # Swiss-Prot record | Proportion (%) | Total (%) |
| 2 | 155 | 25.3 | 25.3 |
| 3 | 147 | 24.0 | 49.4 |
| 1 | 146 | 23.8 | 73.3 |
| 4 | 74 | 12.1 | 85.4 |
| 5 | 32 | 5.23 | 90.6 |
| 6 | 22 | 3.60 | 94.2 |
| 7 | 13 | 2.12 | 96.3 |
| 8 | 7 | 1.14 | 97.5 |
| 9 | 5 | 0.81 | 98.3 |
| 12 | 3 | 0.49 | 98.8 |
| 10 | 3 | 0.49 | 99.3 |
| 33 | 1 | 0.16 | 99.5 |
| 11 | 1 | 0.16 | 99.6 |
| 14 | 1 | 0.16 | 99.8 |
| 15 | 1 | 0.16 | 100 |
Example of distances for task 2.1.
| Sentence | Direct match | Smith-Waterman | Levenshtein | Jaccard | Jaro | FinalScore |
| S1 | 1 | 19 | -45 | 0.062 | 0.62 | 29 |
| S2 | 2 | 51 | -18 | 0.12 | 0.58 | 71 |
| S3 | 2 | 18 | -12 | 0.083 | 0.58 | 38 |
Distribution of token per terms in the Gene Ontology.
| # token | # GO term | Proportion (%) | Total (%) |
| 1 | 391 | 2.34 | 2.34 |
| 2 | 4046 | 24.2 | 26.5 |
| 3 | 6263 | 37.5 | 64.1 |
| 4 | 2723 | 16.3 | 80.4 |
| 5 | 1563 | 9.36 | 89.8 |
| 6 | 833 | 4.99 | 94.8 |
| 7 | 395 | 2.36 | 97.1 |
| 8 | 204 | 1.22 | 98.3 |
| 9 | 97 | 0.58 | 98.9 |
| 10 | 42 | 0.25 | 99.2 |
| 11 | 31 | 0.18 | 99.4 |
| 12 | 38 | 0.22 | 99.6 |
| 13 | 16 | 0.09 | 99.7 |
| 14 | 12 | 0.07 | 99.8 |
| 15 | 11 | 0.06 | 99.8 |
| 16 | 5 | 0.02 | 99.9 |
| 17 | 2 | 0.01 | 99.9 |
| 19 | 1 | 0.00 | 99.9 |
| 22 | 1 | 0.00 | 99.9 |
| 24 | 2 | 0.01 | 99.9 |
| 25 | 4 | 0.02 | 99.9 |
| 26 | 2 | 0.01 | 99.9 |
| 27 | 1 | 0.00 | 99.9 |
| 28 | 1 | 0.00 | 99.9 |
Term Weights in the SMART System.
| Term Frequency | |
| First Letter | |
| n (natural) | |
| l (logarithmic) | 1 + |
| a (augmented) | α + β × ( |
| Inverse Document Frequency | |
| Second Letter | |
| n(no) | 1 |
| t(full) | |
| Normalization | |
| Third Letter | |
| n(no) | 1 |
| c(cosine) | |
Sample of GO synonyms for each axis.
| function: cholesterol O-acyltransferase – sterol O-acyltransferase activity |
| component: protoplasm – intracellular |
| process: cell division – cytokinesis |
Sample of GO definitions.
| term: TRAIL receptor 2 biosynthesis |
| goid: GO:0045559 |
| definition: The formation from simpler components of TRAIL-R2 (TNF-related apoptosis inducing ligand receptor 2), which engages a caspase-dependent apoptotic pathway and mediates apoptosis via the intracellular adaptor molecule FADD/MORT1. |
| term: trans-2-enoyl-CoA reductase (NADPH) activity |
| goid: GO:0019166 |
| definition: Catalysis of the reaction: acyl-CoA + NADP+ = trans-2,3-dehydroacyl-CoA + NADPH + H+. |
Distribution of the most frequent GO terms in the 640 items Swiss-Prot data set (DSI): cut-off at 14 occurrences.
| GO ID | # Occurence | Proportion (%) | Total (%) | Term |
| GO:0005634 | 62 | 3.41 | 3.41 | nucleus |
| GO:0007165 | 58 | 3.19 | 6.60 | signal transduction |
| GO:0005737 | 50 | 2.75 | 9.36 | cytoplasm |
| GO:0005887 | 47 | 2.58 | 11.9 | integral to plasma membrane |
| GO:0005886 | 30 | 1.65 | 13.6 | plasma membrane |
| GO:0003700 | 27 | 1.48 | 15.0 | transcription factor activity |
| GO:0016021 | 27 | 1.48 | 16.5 | integral to membrane |
| GO:0005515 | 19 | 1.04 | 17.6 | protein binding |
| GO:0006412 | 16 | 0.88 | 18.5 | protein biosynthesis |
| GO:0006810 | 15 | 0.82 | 19.3 | transport |
| GO:0006468 | 14 | 0.77 | 20.0 | protein amino acid phosphorylation |
Passage retrieval: results for each GO axis.
| biological process | cellular component | molecular function | Total | |
| # submitted passage | 710 | 185 | 361 | 1256 |
| # evaluated passage | 330 | 126 | 205 | 661 |
| GO-high | 12 % | 20 % | 18 % | 15 % |
| GO-generally | 11 % | 09 % | 16 % | 12 % |
| GO-low | 74 % | 67 % | 64 % | 70 % |
| Prot-high | 59 % | 55 % | 49 % | 55 % |
| Prot-generally | 05 % | 05 % | 13 % | 08 % |
| Prot-low | 33 % | 36 % | 36 % | 35 % |
Figure 1Task 2.2. Official results. Submitted runs are in gray (threshold = 0). White histograms show the performance of the passage retrieval tools when only highly reliable results are considered (threshold = 0.6). Black histograms show the performance of the retrieval tools using an intermediate confidence threshold (0.3).
Figure 2Results of the GO annotation regarding the protein (Y axis) for different levels of confidence (X axis).
Figure 3Results of the GO annotation (Y axis) for different levels of confidence (X axis).
Results for different system settings.
| Run | VS | RegEx | Thesaurus | NP | GO Definition + Prior | Full Article | MRP ( | MAP ( |
| r1 (baseline) | anc.atn | - | - | - | - | - | 12.08 | 5.97 |
| r2 | ltc.lnn | x | - | - | - | - | 15.86 (+31.3) | 7.11 (+19.1) |
| r3 | ltc.lnn | x | x | - | - | - | 16.10 (+33.2) | 7.34 (+22.5) |
| r4 (official) | ltc.lnn | x | x | x | - | - | ||
| r5 | anc.ltn | x | x | x | - | - | 16.89 (+39.8) | 8.02 (+34.3) |
| r6 | anc.ltn | x | x | x | x | - | ||
| r7 | ltc.lnn | x | x | x | - | x | 16.02 (+32.6) | 7.34 (+22.9) |