| Literature DB >> 24401704 |
David Campos1, Quoc-Chinh Bui2, Sérgio Matos1, José Luís Oliveira1.
Abstract
BACKGROUND: Cellular events play a central role in the understanding of biological processes and functions, providing insight on both physiological and pathogenesis mechanisms. Automatic extraction of mentions of such events from the literature represents an important contribution to the progress of the biomedical domain, allowing faster updating of existing knowledge. The identification of trigger words indicating an event is a very important step in the event extraction pipeline, since the following task(s) rely on its output. This step presents various complex and unsolved challenges, namely the selection of informative features, the representation of the textual context, and the selection of a specific event type for a trigger word given this context.Entities:
Year: 2014 PMID: 24401704 PMCID: PMC3896761 DOI: 10.1186/1751-0473-9-1
Source DB: PubMed Journal: Source Code Biol Med ISSN: 1751-0473
Figure 1Textual representation of a complex biomedical event.
Figure 2Illustration of the processing pipeline for the sentence “Down-regulation of interferon regulatory factor 4 gene expression in leukemic cells.”, highlighting the output of linguistic parsing, shortest paths, provided concepts and extracted triggers.
Figure 3Internal data structure to support a corpus with multiple sentences and associated information, namely tokens, chunks, dependency parsing graph, concept tree and features.
Pseudo-code of the optimization algorithm
| 1) | randomly split dataset | |||
| 2) | for each trigger type | |||
| | a) for each feature type | |||
| | i) activate feature | |||
| | ii) call TrainModels with | |||
| | iii) if no improvement, deactivate feature | |||
| | b) for each context type | |||
| | i) activate context | |||
| | ii) call TrainModels with | |||
| | c) store best performing context on model configuration | |||
| | d) for each feature with n-grams | |||
| | i) for each n-grams | |||
| | (1) activate n-gram | |||
| | (2) call TrainModels with | |||
| | ii) store best performing n-gram of feature | |||
| | e) for each feature with dependency hops | |||
| | i) for each dependency hop | |||
| | (1) activate hop | |||
| | (2) call TrainModels with | |||
| | ii) store best performing n-gram of feature | |||
| | f) for each feature with vertex feature type | |||
| | i) for each vertex feature type | |||
| | (1) activate vertex type | |||
| | (2) call TrainModels with | |||
| | ii) store best performing vertex type of feature | |||
| 3) Return MC | ||||
| 1) for each | ||||
| | a) train model | |||
| | b) get performance of model | |||
| | c) store performance and model order if better | |||
| 2) return better performance and order | ||||
Statistics of the training and development data sets of the BioNLP 2009 GENIA shared task: number of abstracts, sentences, annotated proteins, events and triggers
| Abstracts | 800 | 150 |
| Sentences | ≈7449 | ≈1450 |
| Proteins | 9300 | 2080 |
| Events | 8615 | 1795 |
| Triggers | 7041 | 1476 |
Figure 4Illustration of the workflow applied to perform optimization, train the final models and annotate the development corpus.
Figure 5Detailed performance results achieved by the proposed automatic approach compared with existing state-of-the-art systems.