| Literature DB >> 15960829 |
Evelyn B Camon1, Daniel G Barrell, Emily C Dimmer, Vivian Lee, Michele Magrane, John Maslen, David Binns, Rolf Apweiler.
Abstract
BACKGROUND: The Gene Ontology Annotation (GOA) database http://www.ebi.ac.uk/GOA aims to provide high-quality supplementary GO annotation to proteins in the UniProt Knowledgebase. Like many other biological databases, GOA gathers much of its content from the careful manual curation of literature. However, as both the volume of literature and of proteins requiring characterization increases, the manual processing capability can become overloaded. Consequently, semi-automated aids are often employed to expedite the curation process. Traditionally, electronic techniques in GOA depend largely on exploiting the knowledge in existing resources such as InterPro. However, in recent years, text mining has been hailed as a potentially useful tool to aid the curation process. To encourage the development of such tools, the GOA team at EBI agreed to take part in the functional annotation task of the BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology) challenge. BioCreAtIvE task 2 was an experiment to test if automatically derived classification using information retrieval and extraction could assist expert biologists in the annotation of the GO vocabulary to the proteins in the UniProt Knowledgebase. GOA provided the training corpus of over 9000 manual GO annotations extracted from the literature. For the test set, we provided a corpus of 200 new Journal of Biological Chemistry articles used to annotate 286 human proteins with GO terms. A team of experts manually evaluated the results of 9 participating groups, each of which provided highlighted sentences to support their GO and protein annotation predictions. Here, we give a biological perspective on the evaluation, explain how we annotate GO using literature and offer some suggestions to improve the precision of future text-retrieval and extraction techniques. Finally, we provide the results of the first inter-annotator agreement study for manual GO curation, as well as an assessment of our current electronic GO annotation strategies.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15960829 PMCID: PMC1869009 DOI: 10.1186/1471-2105-6-S1-S17
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Important regions of a paper for GO annotation and the type of GO evidence codes that can be typically extracted from these regions.
| Title/Abstract | Non-traceable author statement (NAS) Traceable author statement (TAS) |
| Introduction | Non-traceable author statement (NAS) Traceable author statement (TAS) |
| Results | All GO evidence codes [27] |
| Discussion | All GO evidence codes [27] |
| Figure Legend | All GO evidence codes [27] |
| Materials and Methods | Identify species (via cell line). Identify GO evidence code according to experiment used. |
| estrogen receptor activity | GO:0030284 | UniProt | TAS | PubMed: 11181953 | |
| estrogen receptor activity | GO:0030284 | UniProt | ISS | PubMed: 11181953 | ESR2_HUMAN Q92731 |
Figure 1BioCreAtIvE Evaluation Tool (subtask 2.2). showing GO annotation of 'kinase activity' GO:0016301 (right tool bar) by user 9-1 with supporting text evidence (central panel). The left tool bar shows the UniProt accession number, in this case Q8IWU2 has been annotated. Q8IWU2 represents a KPI-2 protein so the user has been evaluated based on the evidence text as 'high' for GO term prediction and 'high' for representing the correct gene product. The user also uses this sentence to predict the GO term 'receptor signaling protein serine/threonine kinase activity'(GO:0004702). Although that GO annotation is correct for this protein the evidence text supplied does not support that level of detail. The same evidence text was evaluated as 'general' for the GO term prediction of GO:0004702 (same lineage as correct GO term 'kinase activity') and 'high' for representing the correct gene product.
Evaluation criteria for GO and protein predictions.
| The GO term assignment was correct or close to what a curator would choose, given the evidence text. | The protein mentioned in the evidence text correctly represented the associated UniProt accession (correct species). | |
| The GO term assignment was in the correct lineage, given the evidence text, but was too high level (parent of the correct GO term) e.g. | The evidence text did not support annotation to the associated UniProt accession but was generally correct for the protein family or orthologs (non-human species). | |
| The evidence text did not support the GO term assignment. | The evidence text did not mention the correct protein (e.g. for Rev7 protein (ligand) incorrect evidence text referred to 'Rev7 receptor') or protein family. |
Summary of mistakes and curator comments following the task 2 evaluation.
| Predicting obsolete GO terms | Strip obsolete GO terms, i.e. children of |
| Predicting GO terms from Materials and Methods e.g. 'pH' value yielded 'pH domain binding' (GO:0042731), 'CHO cell line' yielded numerous GO terms containing 'acetyl | Only look in certain sections of the paper for features. See Table 1 for GOA. |
| Predicting plant GO terms to human proteins e.g. | Look at GO Documentation on |
| Highlighting too much text | Set limit on evidence text highlight to be useful for curators. Limit to <5 lines. |
| Over-predicting GO terms from one line of text | More important to curator to choose a higher level term that is correct than to be too specific and incorrect. |
| Common GO terms predicted out of context e.g. text 'mapped to chromosome 3q26' yielded GO component term 'chromosome' GO:0005694. Text indicates chromosome number, not where the protein functions. e.g. text '249 amino acid' yielded multiple GO terms i.e. 'amino acid activation' GO:0043038. | Most papers will mention chromosome location and the amino acid length of a sequence. |
| Choosing first paragraph of paper as supporting text | Although a lot of information can be found in introduction of paper, the task was to choose the highlight which supported the GO term. |
| Difficulty in interpreting word order e.g. 'RNA binding protein' yielded the incorrect GO prediction 'protein binding' | |
| Difficulty in predicting correct taxonomic origin of protein. | This can also be difficult for a curator, given lack of evidence in text. |
| Too many low confidence runs | Only submit data with high confidence level for evaluation. Limit participants to their best run/technique. (little difference between runs, repeat evaluations) |
Inter-annotator agreement.
| 47 | 35 | 35 | 39 | |
| 15 | 20 | 19 | 18 | |
| 56 | 39 | 35 | 43 | |
| 107 | 91 | 85 | 94 | |
| 11 | 3 | 4 | 18 | |
| 118 | 94 | 89 | 100 | |
| 0.91 | 0.96 | 0.96 | 0.94 | |
| 0.70 | 0.72 | 0.73 | 0.72 | |
| 0.79 | 0.82 | 0.83 | 0.82 |
Where precision is the fraction of manual GO term annotations that are correct (number of correct annotations / (number of correct annotations + number of incorrect annotations). Recall is defined as the fraction of correct GO term annotations that were successfully retrieved during manual annotation (number of correct annotations / number of correct annotations + (number of annotations from new lineage - number of incorrect annotations). New lineage annotations minus incorrect annotations represent total number of the GO terms that the curators should have correctly retrieved from the paper. F-measure = (balanced precision and recall) = 2 × P × R/(P+R).
Comparison of BioCreAtIvE test set manual annotations with electronic GO annotation predictions.
| 635 | 385 | 27 | ||
| 151 (0.24) | 62 (0.16) | 18(0.67) | Correct | |
| 24 (0.04) | 10 (0.03) | 3 (0.11) | Potentially Incorrect/Correct | |
| 273 (0.43) | 170 (0.44) | 1 (0.04) | Correct | |
| 297 (0.47) | 180 (0.47) | 4 (0.15) | Potentially Incorrect/Correct | |
| 187 (0.29) | 143 (0.37) | 5 (0.19) | Potentially Incorrect/Correct | |
| 211 (0.33) | 153 (0.40) | 8 (0.30) | ||
| 424 (0.67) | 232 (0.60) | 19 (0.70) | ||
| 0.67–1.00 | 0.60–1.00 | 0.70–1.00 |
Where the GO evidence code IEA is 'Inferred from Electronic Annotation' [27]. 'Same lineage > granularity' means where the electronic mapping (InterPro2GO, EC2GO or SPKW2GO) predicted a GO term that was in the same lineage/branch as the manually curated GO term but represented a more granular/parent term. 'Total potential incorrect' annotations = 'Same lineage >granularity' + 'New lineage'. 'Total minimal correct' annotations = 'Exact term' + 'Same lineage < granularity'. Percentage calculations are represented in parentheses.
Manual verification of electronic GO annotation reliability on 44 proteins.
| 44 | 44 | 44 | 44 | |
| 29 (0.65) | 25 (0.56) | 11 (0.25) | - | |
| 15 (0.34) | 19 (0.43) | 33 (0.75) | - | |
| 107 (0.63) | 53 (0.30) | 11 (0.06) | 171 | |
| 97 | 48 | 11 | 156 | |
| 0 | 1 | 0 | 1 | |
| 10 | 4 | 0 | 14 | |
| 40 | 20 | 10 | 70 | |
| 57 | 28 | 1 | 86 | |
| 10 | 5 | 0 | 15 | |
| 0.91–1.00 | 0.91–0.98 | 1.00 | 0.91–0.99 |
Percentage calculations are represented in parentheses.