| Literature DB >> 20529932 |
Jari Björne1, Filip Ginter, Sampo Pyysalo, Jun'ichi Tsujii, Tapio Salakoski.
Abstract
MOTIVATION: There has recently been a notable shift in biomedical information extraction (IE) from relation models toward the more expressive event model, facilitated by the maturation of basic tools for biomedical text analysis and the availability of manually annotated resources. The event model allows detailed representation of complex natural language statements and can support a number of advanced text mining applications ranging from semantic search to pathway extraction. A recent collaborative evaluation demonstrated the potential of event extraction systems, yet there have so far been no studies of the generalization ability of the systems nor the feasibility of large-scale extraction.Entities:
Mesh:
Year: 2010 PMID: 20529932 PMCID: PMC2881365 DOI: 10.1093/bioinformatics/btq180
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Targeted event types with brief example statements expressing an event of each type
| Event type | Example |
|---|---|
| Gene expression | |
| Transcription | promoter associated with |
| Localization | phosphorylation and nuclear |
| Protein catabolism | |
| Phosphorylation | |
| Binding | |
| Regulation | |
| Positive regulation | |
| Negative regulation | |
In the examples, the word or words marked as triggering the presence of the event are shown in italics and event participants underlined. The event types are grouped by event participants, with the first five types taking one theme, binding events multiple themes and the regulation types theme and cause participants.
Fig. 1.Event extraction. A multiphased system is used to generate an event graph, a formal representation for the semantic content of the sentence. Before event detection, sentences are parsed (A) to generate a suitable syntactic graph to be used in detecting semantic relationships. Event detection starts with identification of named entities (B) with BANNER (parses are not used at this step). Once named entities have been identified, the trigger detector (C) uses them and the parse for predicting triggers, words which define potential events. The edge detector (D) predicts relationship edges (event arguments) between triggers and named entities. Finally, the resulting semantic graph is divided into individual events by (E) duplicating trigger nodes and regrouping argument edges.
NER system performance
| System | Corpus | References | |
|---|---|---|---|
| JNLPBA (best) | GENIA term | 72.6% | Kim |
| BioCreative I (best) | GENETAG | 83.2% | Yeh |
| BioCreative II (best) | GENETAG | 87.2% | Smith |
| BANNER | GENETAG | 86.4% | Leaman and Gonzalez ( |
Performance shown for the best performing systems at various shared tasks and the BANNER system used in this work.
Note that while the GENIA term corpus used in the JNLPBA task requires differentiation between, e.g. protein and gene entities, GENETAG only marks a single gene/RNA/protein type, contributing to the measured differences.
Frequency of the nine event types in the output of the system on the PubMed sample
| Event Type | Count (%) |
|---|---|
| Gene expression | 48 144 (28.5) |
| Positive regulation | 43 155 (25.5) |
| Binding | 24 159 (14.3) |
| Negative regulation | 21 833 (12.9) |
| Regulation | 13 330 (7.9) |
| Localization | 10 766 (6.4) |
| Phosphorylation | 3 852 (2.3) |
| Transcription | 2 492 (1.5) |
| Protein catabolism | 1 218 (0.7) |
| TOTAL | 168 949 (100.0) |
Fig. 2.Total number of citations and citations with tagged gene/protein mentions and events in the sample by year.
Fig. 3.Number of citations with tagged mentions of insulin, IgG and TNF-alpha (normalized for capitalization and hyphenization), as well as extracted events of these proteins. The counts are cumulative for every five years to smooth the curves.
Fig. 4.Extracted event network around interleukin-4. This graph shows a subset of the predicted event network, including only named entities with at least 50 extracted instances. The round event nodes are (P)ositive regulation, (N)egative regulation, (R)egulation, gene (E)xpression, (B)inding, p(H)osphorylation and (L)ocalization. For clarity, single-argument events (E, B, H and L) are displayed only when they also act as arguments of regulation events.
Top related MeSH descriptors for the nine event types
| Event type | Five most related MeSH descriptors |
|---|---|
| Gene expression | Gene expression regulation; RNA; gene expression; |
| cytokines; immunohistochemistry | |
| Positive regulation | Intracellular signaling peptides and proteins; |
| phosphotransferases; transcription factors; | |
| cytokines; gene expression regulation | |
| Negative regulation | Molecular mechanisms of pharmacological action; |
| intracellular signaling peptides and proteins; | |
| therapeutic uses; phosphotransferases; | |
| tumor cells, cultured | |
| Binding | Protein binding; information services; |
| physicochemical phenomena; chemistry techniques; | |
| analytical receptors; cell surface | |
| Regulation | Gene expression regulation; RNA, messenger; |
| transcription factors; protein kinases; peptide | |
| hormones | |
| Localization | Endocrine system, protein precursors; nerve tissue |
| proteins; hormones, hormone substitutes and | |
| hormone antagonists; organelles | |
| Transcription | RNA; gene components; gene expression; base |
| sequence; transcription factors | |
| Phosphorylation | Organic chemistry phenomena; tyrosine |
| adaptor proteins, signal transducing; | |
| phosphotransferases (alcohol group acceptor) | |
| phosphoproteins | |
| Protein catabolism | Hydrolases; macromolecular substances; |
| technology, industry and agriculture; | |
| physicochemical processes; metabolism |
Comparison of named entity and event detection precision between MeSH-relevant and MeSH-irrelevant citations
| Citation | TP | FP | Precision (%) | |
|---|---|---|---|---|
| Named entities | MeSH-relevant | 66 | 5 | 93.0 |
| MeSH-irrelevant | 21 | 8 | 72.4 | |
| Events | MeSH-relevant | 58 | 29 | 66.7 |
| MeSH-irrelevant | 4 | 9 | 30.8 | |
Processing requirements for different components as measured for the sample and estimated for the whole PubMed
| Sample | PubMed | |||
|---|---|---|---|---|
| Component | Time (h) | Space (MB) | Time (day) | Space (GB) |
| NER (BANNER) | 18 | 272 | 75 | 27 |
| Parsing | 53 | 830 | 222 | 81 |
| Event extraction | 27 | 276 | 114 | 27 |
| TOTAL | 98 | 1378 | 411 | 135 |
Space requirements are stated for uncompressed XML files.