| Literature DB >> 26587051 |
Jorge A Vanegas1, Sérgio Matos2, Fabio González1, José L Oliveira2.
Abstract
This paper presents a review of state-of-the-art approaches to automatic extraction of biomolecular events from scientific texts. Events involving biomolecules such as genes, transcription factors, or enzymes, for example, have a central role in biological processes and functions and provide valuable information for describing physiological and pathogenesis mechanisms. Event extraction from biomedical literature has a broad range of applications, including support for information retrieval, knowledge summarization, and information extraction and discovery. However, automatic event extraction is a challenging task due to the ambiguity and diversity of natural language and higher-level linguistic phenomena, such as speculations and negations, which occur in biological texts and can lead to misunderstanding or incorrect interpretation. Many strategies have been proposed in the last decade, originating from different research areas such as natural language processing, machine learning, and statistics. This review summarizes the most representative approaches in biomolecular event extraction and presents an analysis of the current state of the art and of commonly used methods, features, and tools. Finally, current research trends and future perspectives are also discussed.Entities:
Mesh:
Year: 2015 PMID: 26587051 PMCID: PMC4637451 DOI: 10.1155/2015/571381
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Figure 1Example of complex biomolecular event extracted from a text fragment. A recursive structure, composed of two types of events, is presented: Positive Regulation and Expression.
Figure 2Overall pipeline of a biomedical event extraction solution. Joint prediction methods merge steps 3 and 4 in a single step. The corresponding reference paper for each tool and method is also identified [13–50].
Most common features used in the main event detection stages.
| Feature groups | Features | Trigger recognition | Edge detection |
|---|---|---|---|
| Token | Part-of-speech |
|
|
| Lemma | |||
| Orthographic |
| ||
| Char n-grams |
| ||
| Word shape |
| ||
| Prefixes/suffixes |
| ||
|
| |||
| Sentence and local context | Number of entities |
| |
| BoW counts |
| ||
| Windows or conjunctions of features |
| ||
|
| |||
| Dependency | Number and type of dependency edges |
| |
| Words, lemmas, or POS tags in dependency path |
|
| |
| N-grams in dependency path |
|
| |
|
| |||
| External resources | WordNet lemmas |
|
|
| Trigger lexicon |
|
| |
| Entity lexicon |
|
| |
Figure 3Trigger detection for two example sentences: (a) “RFLAT-1 activates RANTES gene expression” and (b) “Inhibition of LITAF mRNA expression in THP-1 cells resulted in a reduction of TNF-alpha transcripts.”
Most relevant work addressing the problem of trigger detection. Studies are listed in chronological order and the different approaches are classified in three main groups: rule-based, dictionary-based, and ML-based strategies.
| Approach | Reference | |||||
|---|---|---|---|---|---|---|
| Rule-based | Dictionary-based | ML-based | ||||
| SVM | CRF | VSM | MEMM | |||
| X | X |
Kilicoglu and Bergler 2009 [ | ||||
| X | X | MacKinlay et al. 2009 [ | ||||
| X (structural) | X | Björne et al. 2009 [ | ||||
| X | Miwa et al. 2010 [ | |||||
| X | X | Le Minh et al. 2011 [ | ||||
| X | X | X | Kilicoglu and Bergler 2011 [ | |||
| X | Casillas et al. 2011 [ | |||||
| X | X (L, R) | Van Landeghem et al. 2011 [ | ||||
| X (P) | X | X (CS) | Martinez and Baldwin 2011 [ | |||
| X | Zhou and He 2011 [ | |||||
| X (L) | Miwa et al. 2012 [ | |||||
| X (L) | Björne et al. 2012 [ | |||||
| X (C) | Qian and Zhou 2012 [ | |||||
| X (L) | Wang et al. 2013 [ | |||||
| X (L) | Hakala et al. 2013 [ | |||||
| X (L) | Zhang et al. 2013 [ | |||||
| X (L) | Liu et al. 2013 [ | |||||
| X | Campos et al. 2014 [ | |||||
| X (L) | Xia et al. 2014 [ | |||||
L: linear kernel; R: radial basis function kernel; P: polynomial kernel; C: convolution tree kernel; CS: cosine similarity.
Figure 4Event extraction from two example sentences: (a) “phosphorylation of TRAF2” and (b) “TNF-alpha which is a rapid activator of IL-8 gene expression.”
Most relevant work addressing the problem of edge detection. Studies are listed in chronological order and the different approaches are classified in three main groups: rule-based, dictionary-based, and ML-based strategies.
| Approach | Reference | ||||
|---|---|---|---|---|---|
| Rule-based | Dictionary-based | ML-based | |||
| SVM | CRF | HVS | |||
|
|
| Kilicoglu and Bergler 2009 [ | |||
|
|
| Björne et al. 2009 [ | |||
|
| MacKinlay et al. 2009 [ | ||||
|
| Miwa et al. 2010 [ | ||||
|
| Le Minh et al. 2011 [ | ||||
|
| Kilicoglu and Bergler 2011 [ | ||||
|
|
| Zhou and He 2011 [ | |||
|
| Martinez and Baldwin 2011 [ | ||||
|
| Miwa et al. 2012 [ | ||||
|
| Björne et al. 2012 [ | ||||
|
| Wang et al. 2013 [ | ||||
|
| Hakala et al. 2013 [ | ||||
|
| Xia et al. 2014 [ | ||||
L: linear kernel.
Modality detection. Most relevant work addressing the problem of modality detection classified in rule-based, dictionary-based, and ML-based strategies.
| Approach | Reference | |||
|---|---|---|---|---|
| Rule-based | Dictionary-based | ML-based | ||
| SVM | CRF | |||
|
|
| Kilicoglu and Bergler 2009 [ | ||
|
| Björne et al. 2009 [ | |||
|
| MacKinlay et al. 2009 [ | |||
|
| Miwa et al. 2010 [ | |||
|
| Kilicoglu and Bergler 2011 [ | |||
|
| Miwa et al. 2012 [ | |||
|
| Björne et al. 2012 [ | |||
|
| Van Landeghem et al. 2013 [ | |||
|
| Xia et al. 2014 [ | |||
L: linear kernel.
Core event extraction performance comparison. BioNLP shared task comparison results in recall/precision/F-score (%) on the test set for Task 1 (core event extraction). (A) abstracts only and (F) full papers. Data extracted from BioNLP-ST 2009, BioNLP-ST 2011, and BioNLP-ST 2013 overviews [51, 52, 109].
| Year | System | Event type | Total | |||
|---|---|---|---|---|---|---|
| Simple | Binding | Regulation | ||||
| 2009 | UTurku | (A) | 64.21/77.45/70.21 | 40.06/49.82/44.41 | 35.63/45.87/40.11 | 46.73/58.48/51.95 |
|
| ||||||
| 2010 | Miwa | (A) | 65.31/76.44/70.44 | 52.16/53.08/52.62 | 35.93/46.66/40.60 | 48.62/58.96/53.29 |
|
| ||||||
| 2011 | FAUST | (A) | 66.16/81.04/72.85 | 45.53/58.09/51.05 | 39.38/58.18/46.97 | 50.00/67.53/57.46 |
| UMass | (A) | 64.21/80.74/71.54 | 43.52/60.89/50.76 | 38.78/55.07/45.51 | 48.74/65.94/56.05 | |
|
| ||||||
| 2013 | EVEX | (F) | 73.83/79.56/76.59 | 41.14/44.77/42.88 | 32.41/47.16/38.41 | 45.44/58.03/50.97 |
| TEES-2.1 | (F) | 74.19/79.64/76.82 | 42.34/44.34/43.32 | 33.08/44.78/38.05 | 46.17/56.32/50.74 | |
| BioSEM | (F) | 67.71/86.90/76.11 | 47.45/52.32/49.76 | 28.19/49.06/35.80 | 42.47/62.83/50.68 | |
Event enrichment performance comparison. BioNLP shared task comparison results in recall/precision/F-score (%) on the test set for Task 2 (event enrichment). (A) abstracts only and (F) full papers. Data extracted from BioNLP-ST 2009, BioNLP-ST 2011, and BioNLP-ST 2013 overviews [51, 52, 109].
| Year | System | Site | Localization | Total | |
|---|---|---|---|---|---|
| 2009a | UTurku + DBCLS09 | (A) | 71.43/71.43/71.43 | 23.08/88.24/36.59 | 32.14/72.41/44.52 |
|
| |||||
| 2011b | FAUST | (A) | 43.51/71.25/54.03 | 36.92/77.42/50.00 | 41.33/72.97/52.77 |
| UMass | (A) | 42.75/70.00/53.08 | 36.92/77.42/50.00 | 40.82/72.07/52.12 | |
|
| |||||
| 2013c | TEES-2.1 | (F) | 20.68/59.82/30.73 | 36.67/78.57/50.00 | 22.03/61.90/32.50 |
| EVEX | (F) | 19.44/59.43/29.30 | 36.67/78.57/50.00 | 20.90/61.67/31.22 | |
aOnly phosphorylation sites were considered.
bThe results are for overall binding and phosphorylation sites.
cThe task included the prediction of sites for other protein modification and regulation events.
Negation and speculation detection performance comparison. BioNLP shared task comparison results in recall/precision/F-score (%) on the test set for Task 3 (negation/speculation detection). (A) abstracts only and (F) full papers only. Data extracted from BioNLP-ST 2009, BioNLP-ST 2011, and BioNLP-ST 2013 overviews [51, 52, 109].
| Year | System | Negation | Speculation | Total | |
|---|---|---|---|---|---|
| 2009 | ConcordU09 | (A) | 14.98/50.75/23.13 | 16.83/50.72/25.27 | 15.86/50.74/24.17 |
|
| |||||
| 2011 | UTurku | (A) | 22.03/49.02/30.40 | 19.23/38.46/25.64 | 20.69/43.69/28.08 |
| ConcordU11 | (A) | 18.06/46.59/26.03 | 23.08/40.00/29.27 | 20.46/42.79/27.68 | |
|
| |||||
| 2013 | TEES-2.1 | (F) | 21.68/36.84/27.30 | 18.46/33.96/23.92 | 19.53/35.59/25.22 |
| EVEX | (F) | 20.98/38.03/27.04 | 18.46/32.73/23.61 | 19.82/34.41/25.15 | |