| Literature DB >> 26201478 |
Xiao Liu, Antoine Bordes, Yves Grandvalet.
Abstract
BACKGROUND: Huge amounts of electronic biomedical documents, such as molecular biology reports or genomic papers are generated daily. Nowadays, these documents are mainly available in the form of unstructured free texts, which require heavy processing for their registration into organized databases. This organization is instrumental for information retrieval, enabling to answer the advanced queries of researchers and practitioners in biology, medicine, and related fields. Hence, the massive data flow calls for efficient automatic methods of text-mining that extract high-level information, such as biomedical events, from biomedical text. The usual computational tools of Natural Language Processing cannot be readily applied to extract these biomedical events, due to the peculiarities of the domain. Indeed, biomedical documents contain highly domain-specific jargon and syntax. These documents also describe distinctive dependencies, making text-mining in molecular biology a specific discipline.Entities:
Mesh:
Year: 2015 PMID: 26201478 PMCID: PMC4511465 DOI: 10.1186/1471-2105-16-S10-S8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Sentence and associated events. Excerpt from the BioNLP 2013 Genia task.
Classes and types of events with their arguments (P stands for Protein, E for Event).
|
|
|
|
|
|---|---|---|---|
| Gene expression | theme (P) | ||
| S | Transcription | theme (P) | |
| V | Protein catabolism | theme (P) | |
| T | Phosphorylation | theme (P) | |
| Localization | theme (P) | ||
| B | |||
| I | Binding | theme (P) | theme_2 (P) |
| N | |||
| R | Regulation | theme (P/E) | cause (P/E) |
| E | Positive regulation | theme (P/E) | cause (P/E) |
| G | Negative regulation | theme (P/E) | cause (P/E) |
Statistics of corpora of BioNLP GE tasks.
| Training | Development | Test | ||||
|---|---|---|---|---|---|---|
| 2009 | 800 | 0 | 150 | 0 | 260 | 0 |
| 2011 | 800 | 108 | 150 | 109 | 260 | 87 |
| 2013 | 0 | 222 | 0 | 249 | 0 | 305 |
Figure 2Examples of ambiguous annotations.
Confusion matrix for RUPEE on the BioNLP 2013 GE task, computed by cross-validation on the training and development sets.
|
| None | Gene exp | Trans | Pro cat | Phosp | Local | Bind | Regu | Pos reg | Neg reg | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| None | 223460 | 404 | 163 | 27 | 42 | 60 | 296 | 390 | 799 | 397 | 226038 |
| Gene expression | 440 | 2741 | 13 | 0 | 0 | 17 | 2 | 1 | 43 | 5 | 3262 |
| Transcription | 186 | 16 | 565 | 0 | 0 | 0 | 0 | 4 | 15 | 0 | 786 |
| Protein catabolism | 30 | 0 | 0 | 150 | 0 | 0 | 0 | 0 | 1 | 0 | 181 |
| Phosphorylation | 76 | 0 | 0 | 0 | 413 | 0 | 0 | 0 | 0 | 0 | 489 |
| Localization | 114 | 20 | 0 | 0 | 0 | 398 | 4 | 0 | 1 | 2 | 539 |
| Binding | 507 | 0 | 0 | 0 | 0 | 1 | 1470 | 2 | 0 | 1 | 1981 |
| Regulation | 453 | 0 | 0 | 0 | 0 | 0 | 1 | 813 | 33 | 4 | 1304 |
| Positive regulation | 1245 | 42 | 10 | 0 | 0 | 0 | 2 | 67 | 2456 | 7 | 3829 |
| Negative regulation | 555 | 7 | 2 | 1 | 1 | 0 | 0 | 46 | 11 | 1176 | 1799 |
| 227066 | 3230 | 753 | 178 | 456 | 476 | 1775 | 1323 | 3359 | 1592 | ||
Features used by our system. Most are based on Tokenization2 except when specified
| Features | Examples | |
|---|---|---|
| Base form (stem) of the head token. | ||
| Base form of the head token without '-' or '/' before of after. | ||
| Sub-string after '-' in the head token. | ||
| POS of the head token. | ||
| First token of the entity is after '-' or '/'. | ||
| Last token of the entity is before '-' or '/'. | ||
| Head token has a special prefix: "over", "up", "down", "co" | ||
| Concat. of base form and POS of parents of the head token in dependency parse. | ||
| Concat. of base form and POS of children of the head token in dependency parse. | ||
| Base forms of | Base forms from the 2nd previous token to the 2nd next token are | |
| POS of k neighboring tokens around the entity. | POS from the 2nd previous token to the 2nd next token are | |
| Neighborhood of the entity has '-' or '/'. | Features from the 2nd previous token to the 2nd next token are | |
| Sentence has "mRNA". | ||
| Entity is connected with another string using Tokenization1. | ||
| Argument is a protein. | ||
| POS of the head token. | ||
| Features extracted from IntAct when the argument is a protein. | ||
| Base forms of k neighboring tokens around the argument. | Base forms from the 1st previous token to the 1st next token are | |
| POS of k neighboring tokens around the argument. | POS from the 1st previous token to the 1st next token are | |
| Concat. of base form and POS of parents of the head token in dependency parse. | NSUBJ←requir/VBZ for "regulation" in Figure 3 | |
| Token sequence between candidate and argument has proteins. | ||
| V-walk features between candidate and argument with base forms. | regul | |
| E-walk features between candidate and argument with base forms. | ||
| V-walk features between candidate and argument with POS. | NN, | |
| E-walk features between candidate and argument with POS. | ||
| Candidate and the argument share a token using Tokenization1. | ARG-express | |
Most are based on Tokenization2 except when specified.
Figure 4E-walks and V-walks. Examples of encodings of the dependency parse tree.
F-scores on the test set of the BioNLP 2013 GE task.
|
| TEES 2.1 | EVEX | Pipeline counterpart | RUPEE |
|---|---|---|---|---|
| Gene expression | 82.7 | 82.7 | 83.9 | |
| Transcription | 55.0 | 55.0 | 61.7 | |
| Protein catabol | 56.3 | 56.3 | 66.7 | |
| Phosphorylation | 72.6 | 71.5 | ||
| Localization | 60.7 | 56.9 | 57.7 | |
| SVT TOTAL | 74.9 | 74.5 | 79.0 | |
| BIN TOTAL | 42.9 | 41.6 | 42.4 | |
| Regulation | 23.0 | 23.4 | 23.1 | |
| Positive regul | 38.7 | 39.2 | 36.5 | |
| Negative regul | 43.7 | 38.1 | 43.6 | |
| REG TOTAL | 38.1 | 38.4 | 35.1 | |
| ALL TOTAL | 50.7 | 51.0 | 50.8 | |
F-scores on the test set of the BioNLP 2011 GE task.
|
| FAUST | UCLEED | SEARN | TEES | Pipeline counterpart | RUPEE |
|---|---|---|---|---|---|---|
| SVT | 73.9 | 73.5 | 71.8 | 72.1 | 71.8 | |
| BIN | 48.5 | 48.8 | 45.8 | 43.4 | 40.0 | |
| REG | 44.9 | 43.8 | 43.0 | 42.7 | 35.7 | |
| ALL | 55.2 | 53.5 | 53.3 | 50.0 | 55.6 | |
Figure 5Precision-recall curves of (trigger, theme) pairs classification with level curves of F-score in the background, computed on the BioNLP 2013 development set.