| Literature DB >> 22438567 |
Martin Krallinger1, Florian Leitner, Miguel Vazquez, David Salgado, Christophe Marcelle, Mike Tyers, Alfonso Valencia, Andrew Chatr-aryamontri.
Abstract
There is an increasing interest in developing ontologies and controlled vocabularies to improve the efficiency and consistency of manual literature curation, to enable more formal biocuration workflow results and ultimately to improve analysis of biological data. Two ontologies that have been successfully used for this purpose are the Gene Ontology (GO) for annotating aspects of gene products and the Molecular Interaction ontology (PSI-MI) used by databases that archive protein-protein interactions. The examination of protein interactions has proven to be extremely promising for the understanding of cellular processes. Manual mapping of information from the biomedical literature to bio-ontology terms is one of the most challenging components in the curation pipeline. It requires that expert curators interpret the natural language descriptions contained in articles and infer their semantic equivalents in the ontology (controlled vocabulary). Since manual curation is a time-consuming process, there is strong motivation to implement text-mining techniques to automatically extract annotations from free text. A range of text mining strategies has been devised to assist in the automated extraction of biological data. These strategies either recognize technical terms used recurrently in the literature and propose them as candidates for inclusion in ontologies, or retrieve passages that serve as evidential support for annotating an ontology term, e.g. from the PSI-MI or GO controlled vocabularies. Here, we provide a general overview of current text-mining methods to automatically extract annotations of GO and PSI-MI ontology terms in the context of the BioCreative (Critical Assessment of Information Extraction Systems in Biology) challenge. Special emphasis is given to protein-protein interaction data and PSI-MI terms referring to interaction detection methods.Entities:
Mesh:
Year: 2012 PMID: 22438567 PMCID: PMC3309177 DOI: 10.1093/database/bas017
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.This figure shows schematically how protein interaction data is annotated and/or marked up using ontologies. Systems such as MyMiner (myminer.armi.monash.edu.au/links.php), have been used for text labeling and highlighting purposes in the context of the BioCreative competition. The main steps illustrated in this figure have been addressed in the BioCreative challenges. Finding associations between textual expressions referring to experimental techniques used to characterize protein interactions and their equivalent concepts in the MI ontology is cumbersome in some cases when deep domain inference is required. Experienced curators are able to quickly navigate the term hierarchy to find the appropriate terms while novice annotators often need to search the ontology using method keywords as queries and consult associated descriptive information for potential candidate terms.
Summary of the BioCreative editions related to the identification of ontology terms in articles
| Information | BioCreative I, task 1 | BioCreative I, task 2 | BioCreative II—IMS | BioCreative III—IMS |
|---|---|---|---|---|
| Description | Return evidence text fragments for protein–GO–document triplets | Predict GO annotations derivable from a given protein–article pair | Prediction of MI annotations from PPI-relevant articles | Prediction of MI annotations from PPI-relevant articles (ranked with evidence passages) |
| Ontologies | GO | GO | MI ontology | MI ontology |
| Curators/databases | GOA-EBI | GOA-EBI | MINT and IntAct | BioGRID and MINT |
| Participants | 9 | 6 | 2 | 8 |
| Data/format | Full-text articles, SGML format | Full-text articles, SGML format | Full-text articles, PDF and HTML format | Full-text articles, PDF format |
| Training | 803 articles | 803 articles | 740 articles | 2003 training articles and 587 development set articles |
| Test | 113 articles | 99 articles | 358 articles | 223 articles |
| Evaluation | Three labels (correct, general, wrong), % correct cases | Three labels (correct, general, wrong), % correct cases | Precision, recall and F-score; mapping to the parent terms | Precision, recall, F-score, ranked predictions (AUC iP/R) |
| Methods | Term lookup, pattern matching/template extraction, term tokens (information content of GO words, | Term lookup, pattern matching/template extraction, term tokens (information content of GO words, | Pattern matching, automatically generating variants of MI terms, handcrafted patterns | Cross-ontology mapping, manual and automatic extension of method names, statistic of work tokens building terms (mutual information, chi square), machine learning of training set articles |
| Result highlights | Precisions from 46% to 80%, accuracy of ∼30% | Precisions from 9% to 35% | Precision from 32% to 67%, best | Most between 30% and 80%, best |
| Observation | Limited recall, effect of GO term length | Limited recall, difference in performance depending on GO categories, cellular component terms are easier | Difficulties with very general method terms | Difficulties in case of methods not specific to PPIs, problems with recall |
Figure 2.Historical view and timeline of the BioCreative challenges in the context of other community efforts, textual resources (corpora) and applications developed in the area of biomedical text mining. The upper bar shows the number of new records added to PubMed each year, expressed in thousands (K). The lower bar refers to the corresponding year timeline. Pink squares, appearance of biomedical text mining methods; green octagons, relevant ontologies, lexical resources and corpora; yellow boxes, community challenges; blue ovals, biomedical text mining applications.
Figure 3.Schematic overview of the extraction of GO annotations from the literature. The process illustrates the individual steps of the annotation process, covering the initial selection of relevant documents for GO annotation of proteins, identification of proteins and their corresponding database identifiers followed by the extraction of associations to GO terms and the retrieval of evidence sentences/passages. The participating teams had to provide the evidence passages for a given document–protein–GO term triplet for one subtask, and to actually detect GO–protein associations (together with evidence passages) for the other subtask.
Figure 4.Example predictions of the GO task of BioCreative I. (A) Here a correct prediction is shown, containing the information on the corresponding document, protein and GO term as well as the supporting evidence text passages extracted automatically from the full-text article. (B) Example prediction (wrong) showing a screen shot of the original evaluation interface developed at the time for this task (based on Apache/PHP). The original evaluation application is not functional anymore and was implemented specifically for this task. Proteins and GO terms were defined unambiguously through corresponding standard identifiers. The database curators manually evaluated both the correctness of the protein as well as the GO terms.
Figure 5.Representative predictions submitted for the MI task of BioCreative III of diverse degrees of difficulty for automated systems. The examples correspond to submissions from various teams. Participating teams had to return the article identifier, the concept identifier for the interaction detection method according to the MI ontology, a rank, a confidence score as well as a supporting text evidence passages extracted from the full-text article. Submissions were plain text files where each field was separated using a tabulator. This figure provides colored highlights of original predictions to better grasp the output. In red, the original term from the MI ontology and its synonyms have been added to facilitate the interpretation of the results. As can be seen some cases are rather straightforward, and could be detected by direct term lookup, while others require generating lexical variants or even more sophisticated machine learning and statistical word analysis.