| Literature DB >> 22711795 |
Martin Gerner1, Farzaneh Sarafraz, Casey M Bergman, Goran Nenadic.
Abstract
MOTIVATION: Although the amount of data in biology is rapidly increasing, critical information for understanding biological events like phosphorylation or gene expression remains locked in the biomedical literature. Most current text mining (TM) approaches to extract information about biological events are focused on either limited-scale studies and/or abstracts, with data extracted lacking context and rarely available to support further research.Entities:
Mesh:
Year: 2012 PMID: 22711795 PMCID: PMC3413385 DOI: 10.1093/bioinformatics/bts332
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1System architecture. Each box represents the application of one tool (see main text for description), and each arrow represents the transfer of data from one tool to another. Circles represent data merging and post-processing. Entity recognition is performed by GeneTUKit (genes), GNAT (genes), LINNAEUS (species) and GETM's anatomical NER component. Parsing is performed by the McClosky–Charniak, Enju and Genia dependency parsers. Event extraction is performed by TEES and EventMine. In the final contextualization step, Negmole detects whether events are negated and/or speculative, and events are associated to anatomical entity mentions. Additional document parsing functions provide input to the system and database storage functions handle outputs (but are not shown here)
The total number of gene mentions and the number of normalized, distinct genes recognized by GNAT and GeneTUKit in MEDLINE and PMC
| Source | Entity mentions | Distinct entities | ||||
|---|---|---|---|---|---|---|
| MEDLINE | PMC | MEDLINE + PMC | MEDLINE | PMC | MEDLINE + PMC | |
| GNAT | 35 910 779 | 12 729 471 | 48 050 830 | 227 809 | 129 244 | 253 929 |
| GeneTUKit | 47 989 353 | 19 217 778 | 66 431 789 | 258 765 | 143 706 | 287 218 |
| Intersection | 26 281 266 | 8 638 823 | 34 479 547 | 224 604 | 125 763 | 249 932 |
| Union | 57 618 866 | 23 308 426 | 80 003 072 | 261 412 | 146 552 | 290 557 |
The total number of event mentions and the number of distinct events extracted by TEES and EventMine from MEDLINE and PMC
| Source | Event mentions | Distinct events | ||||
|---|---|---|---|---|---|---|
| MEDLINE | PMC | MEDLINE + PMC | MEDLINE | PMC | MEDLINE + PMC | |
| TEES | 19 406 453 | 4 719 648 | 23 856 554 | 6 570 824 | 1 804 846 | 7 797 604 |
| EventMine | 18 988 271 | 4 010 945 | 22 737 258 | 6 502 371 | 1 588 178 | 7 539 364 |
| Intersection | 9 243 903 | 1 331 456 | 10 455 678 | 3 080 900 | 573 903 | 3 424 372 |
| Union | 29 150 821 | 7 399 137 | 36 138 134 | 9 635 566 | 2 676 257 | 11 442 462 |
Gene/protein NER evaluation results
| Precision (%) | Recall (%) | F-score (%) | |
|---|---|---|---|
| GNAT | 79.8 | 83.7 | 81.7 |
| GeneTUKit | 72.2 | 79.1 | 75.5 |
| Intersection | 82.8 | 70.4 | 76.1 |
| Union | 71.4 | 92.0 | 80.4 |
Event extraction evaluation results on the B+G corpus
| Precision (%) | Recall (%) | F-score (%) | |
|---|---|---|---|
| TEES | 50.4 | 53.6 | 51.9 |
| EventMine | 45.7 | 45.5 | 45.6 |
| Intersection | 66.2 | 36.6 | 47.1 |
| Union | 41.3 | 62.0 | 49.6 |
Event extraction results on the B+G corpus, including negation/speculation detection as processed by Negmole
| Precision (%) | Recall (%) | F-score (%) | |
|---|---|---|---|
| Intersection | 62.6 | 34.6 | 44.6 |
| Union | 38.8 | 58.3 | 46.6 |