| Literature DB >> 15960828 |
Christian Blaschke1, Eduardo Andres Leon, Martin Krallinger, Alfonso Valencia.
Abstract
BACKGROUND: Molecular Biology accumulated substantial amounts of data concerning functions of genes and proteins. Information relating to functional descriptions is generally extracted manually from textual data and stored in biological databases to build up annotations for large collections of gene products. Those annotation databases are crucial for the interpretation of large scale analysis approaches using bioinformatics or experimental techniques. Due to the growing accumulation of functional descriptions in biomedical literature the need for text mining tools to facilitate the extraction of such annotations is urgent. In order to make text mining tools useable in real world scenarios, for instance to assist database curators during annotation of protein function, comparisons and evaluations of different approaches on full text articles are needed.Entities:
Mesh:
Year: 2005 PMID: 15960828 PMCID: PMC1869008 DOI: 10.1186/1471-2105-6-S1-S16
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Task 2 dataset description in numbers. The table shows the basic numbers referring to the task 2 training and test datasets. The full text articles of the training set were from the Journal of Biological Chemistry (JBC), Nature Medicine, Nature Genetics and Oncogene, while the test set articles were all from JBC.
| Data set description | Training set | Test set 2.1 | Test set 2.2 | Data Type |
| Full text articles | 803 | 113 | 99 | free text |
| Total of GO annotation | 2317 | 1076 | 1227 | annotations |
| Number of proteins in the GO annotations | 939 | 138 | 138 | proteins |
| Number of GO terms used in the GO annotations | 776 | 580 | 544 | GO terms |
| Average number of annotations per protein | 2.467 | 7.797 | 8.891 | annotations |
| Annotations of Molecular Function GO terms | 709 | 330 | 356 | annotations |
| Annotations of Biological Process GO terms | 1061 | 544 | 701 | annotations |
| Annotations of Cellular Component GO terms | 547 | 182 | 170 | annotations |
| Molecular Function terms in the annotations | 343 | 173 | 179 | GO terms |
| Biological Process terms in the annotations | 339 | 334 | 314 | GO terms |
| Cellular Component terms in the annotations | 94 | 57 | 51 | GO terms |
Main features used by the participating teams. The table shows the features and strategies adopted by the different participants and the number of users.
| Characteristics (C), resources (R) and methods (M) | Users |
| (C) Sentence level (retrieval unit) | [19,20,22,25,26] |
| (C) Paragraph level (retrieval unit) | [21,23,24] |
| (C) Full article processed | [19,21,22,24,25] |
| (C) Full article processed except methods section | [26] |
| (C) Only abstract processed | [20] |
| (C) GO term – Protein distance | [22,24,25] |
| (M) Stemming | [20,22,24,26] |
| (M) POS tagging | [25,26] |
| (M) Shallow parsing | [25] |
| (M) Finite state automata | [20,25] |
| (M) Edit distance ranking | [20] |
| (M) Vector space model | [20,21] |
| (M) Machine learning technique | [23-25] |
| (M) Support Vector Machines | [23] |
| (M) Naïve Bayes models | [24,25] |
| (M) N-gram models | [24] |
| (M) External resource – tool: GATE NLP tool | [21] |
| (M) External resource – tool: Morphological normalizer BioMorpher | [21] |
| (M) External resource – tool: qtile query based ranking tool | [26] |
| (M) External resource – tool: Grok POS tagger | [25] |
| (M) Heuristic rules | [22,24-26] |
| (M) Regular expressions/pattern matching | [19,20,22,24,25] |
| (M) Literal string matching | [22,24] |
| (R) Protein name aliases (link to external databases) | [22,24,26] |
| (R) GO terms used | [19-26] |
| (R) GOA data used | [22-24] |
| (R) GO term forming words/tokens | [19,22,24,26] |
| (R) GO term variants | [22,25] |
| (R) External resource – data: Dictionary of suffixes | [24] |
| (R) External resource – data: UMLS/MeSH dictionary | [20,24] |
| (R) External resource – data: HUGO database | [22,24,26] |
| (R) External resource – data: SGD database | [24] |
| (R) External resource – data: MGI database | [24] |
| (R) External resource – data: RGD database | [24] |
| (R) External resource – data: TAIR database | [24] |
| (R) External resource – data: Procter and Gamble protein synoyms | [21] |
Task 2 participants. The table shows the overview of the participants undertaking task 2.
| Participant | Methods | Task 2.1 | Task 2.2 |
| Ehrler | Sequentially applied finite state automata | Yes | Yes |
| Couto | Information content of terms | Yes | Yes |
| Krymolowski | Heuristic rules, query expansion and question answering system | Yes | No |
| Verspoor | Word proximity networks approach | Yes | Yes |
| Krallinger | Heuristic weight and sentence sliding window | Yes | No |
| Rice | Term-based SVM approach | Yes | Yes |
| Ray | Statistical learning/Naïve Bayes method | Yes | Yes |
| Chiang | Hybrid method: pattern matching and sentence classification. | Yes | Yes |
Task 2.1 results. The table shows the results of task 2.1 for each participant.
| Participant | Run | Evaluated results | Perfect prediction | Correct protein/general GO |
| Ehrler | 1 | 1048 | 268 (25.57%) | 74 (7.06%) |
| Krymolowski | 1 | 1053 | 166 (15.76%) | 77 (7.31%) |
| Krymolowski | 2 | 1050 | 166 (15.81%) | 90 (8.57%) |
| Krymolowski | 3 | 1050 | 154 (14.67%) | 86 (8.19%) |
| Verspoor | 1 | 1057 | 272 (25.73%) | 154 (14.57%) |
| Verspoor | 2 | 1864 | 43 (2.31%) | 40 (2.15%) |
| Verspoor | 3 | 1703 | 66 (3.88%) | 40 (2.35%) |
| Chiang I | 1 | 251 | 125 (49.80%) | 13 (5.18%) |
| Chiang I | 2 | 70 | 33 (47.14%) | 5 (7.14%) |
| Chiang I | 3 | 89 | 41 (46.07%) | 7 (7.87%) |
| Chiang II | 1 | 45 | 36 (80.00%) | 3 (6.67%) |
| Chiang II | 2 | 59 | 45 (76.27%) | 2 (3.39%) |
| Chiang II | 3 | 64 | 50 (78.12%) | 4 (6.25%) |
| Krallinger | 1 | 1050 | 303 (28.86%) | 69 (6.57%) |
| Rice | 1 | 524 | 59 (11.26%) | 28 (5.34%) |
| Rice | 2 | 998 | 125 (12.53%) | 69 (6.91%) |
| Ray | 1 | 413 | 83 (20.10%) | 19 (4.60%) |
| Ray | 2 | 458 | 7 (1.53%) | 0 (0.00%) |
| Couto | 1 | 1048 | 301 (28.72%) | 57 (5.44%) |
| Couto | 2 | 1048 | 280 (26.72) | 60 (5.73%) |
| Couto | 3 | 1050 | 239 (22.76) | 59 (5.62%) |
Figure 1Task 2.1 Precision versus total true positives (TP) plot. Task 2.1 results: precision vs total number of true positives (protein and GO term predicted correctly, i.e. evaluated as 'high'). Each point represents a single run submitted by the participants of task 2.1. 1: Chiang et al., 2: Couto et al., 3: Ehrler et al., 4: Krallinger et al., 5: Krymolowski et al., 6:Ray et al., 7: Rice et al., 8: Verspoor et al.
Figure 2Task 2.1 True positive predictions compared to GO term length. This plot shows the association between the number of true positives (TP), meaning predictions where the GO term and the corresponding protein were correctly predicted (evaluated as 'high') and the length of GO terms in task 2.1. The length of GO terms was measured by the number of words which form the terms, after splitting at spaces, and certain special characters (e.g. '-', '/' and ',').
Figure 3Task 2.1 Evaluation of the annotation predictions depending on the GO categories. Evaluation of task 2.1 prediction depending on the evaluation type and the GO categories. H: high (correct prediction), G: generally (overall correct but too general to be of practical use) and L: low (basically wrong predictions). The GO categories are CC: Cellular Component, MF: Molecular Function and BP: Biological Process. Notice that only the entirely evaluated predictions are displayed.
Figure 4Task 2.2 Precision versus total true positives plot. Task 2.2 results: Precision vs. total number of True Positives. Each point represents a single run submitted by the participants of task 2.2. User 1: Chiang et al., 2: Couto et al., 3: Ehrler et al., 4: Ray et al., 5: Rice et al., 6: Verspoor et al., *: the remaining runs, refer to table 5.
Task 2.2 results. The table shows the results of task 2.1 for each user.
| Participant | Run | Evaluated results | Perfect prediction | Correct protein/general GO |
| Ehrler | 1 | 634 | 78 (12.30%) | 49 (7.73%) |
| Verspoor | 1 | 110 | 1 (0.91%) | 1 (0.91%) |
| Verspoor | 2 | 344 | 19 (5.52%) | 9 (2.62%) |
| Verspoor | 3 | 229 | 2 (0.87%) | 10 (4.37%) |
| Chiang I | 1 | 26 | 9 (34.62%) | 3 (11.54%) |
| Chiang I | 2 | 41 | 14 (34.15%) | 1 (2.44%) |
| Chiang I | 3 | 41 | 14 (34.15%) | 1 (2.44%) |
| Chiang II | 1 | 113 | 35 (30.97%) | 8 (7.08%) |
| Chiang II | 2 | 85 | 24 (28.24%) | 6 (7.06%) |
| Chiang II | 3 | 113 | 37 (32.74%) | 11 (9.73%) |
| Rice | 1 | 479 | 3 (0.63%) | 8 (1.67%) |
| Rice | 2 | 460 | 16 (3.48%) | 26 (5.65%) |
| Ray | 1 | 244 | 52 (21.31%) | 23 (9.43%) |
| Ray | 2 | 38 | 1 (2.63%) | 0 (0.00%) |
| Ray | 3 | 90 | 1 (1.11%) | 1 (1.11%) |
| Couto | 1 | 617 | 20 (3.24%) | 30 (4.86%) |
| Couto | 2 | 661 | 38 (5.75) | 26 (3.93%) |
| Couto | 3 | 651 | 58 (8.91) | 27 (4.15%) |