| Literature DB >> 26551925 |
Abstract
BACKGROUND: The Turku Event Extraction System (TEES) is a text mining program developed for the extraction of events, complex biomedical relationships, from scientific literature. Based on a graph-generation approach, the system detects events with the use of a rich feature set built via dependency parsing. The TEES system has achieved record performance in several of the shared tasks of its domain, and continues to be used in a variety of biomedical text mining tasks.Entities:
Mesh:
Year: 2015 PMID: 26551925 PMCID: PMC4642046 DOI: 10.1186/1471-2105-16-S16-S4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The TEES event extraction process. Preprocessing steps A-C can be omitted in the BioNLP Shared Tasks as corresponding data is provided by the organizers. The event extraction steps D-F are all independent SVM classification steps, with the trigger and edge detection steps being linked together by the recall adjustment parameter. (Figure adapted from Björne et. al [5].)
The BioNLP'13 GE task annotation scheme, automatically learned from the corresponding corpus.
| Type | Name | Arguments |
|---|---|---|
| ENTITY | Anaphora | |
| ENTITY | Entity | |
| ENTITY | Protein | |
| EVENT | Acetylation [2,2] | Site {Theme} [1,1] Entity / Theme [1,1] Protein |
| EVENT | Binding [1,4] | Site {Theme} [0,2] Entity / Theme [1,2] Protein |
| EVENT | Deacetylation [2,2] | Cause [0,1] Protein / Site {Theme} [0,1] Entity / Theme [1,1] Protein |
| EVENT | Gene expression [1,1] | Theme [1,1] Protein |
| EVENT | Localization [1,2] | Theme [1,1] Protein / ToLoc [0,1] Entity |
| EVENT | Negative regulation [1,3] | Cause [0,1] Acetylation, Binding, Gene expression, Negative regulation, Phosphorylation, Positive regulation, Protein, Protein catabolism, Regulation, Ubiquitination / Site {Cause,Theme} [0,1] Entity / Theme [1,1] Binding, Gene expression, Localization, Negative regulation, Phosphorylation, Positive regulation, Protein, Protein catabolism, Regulation, Transcription, Ubiquitination |
| EVENT | Phosphorylation [1,3] | Cause [0,1] Protein / Site {Theme} [0,1] Entity / Theme [1,1] Protein |
| EVENT | Positive regulation [1,3] | Cause [0,1] Acetylation, Binding, Gene expression, Negative regulation, Phosphorylation, Positive regulation, Protein, Protein catabolism, Regulation, Ubiquitination / Site {Cause,Theme} [0,1] Entity / Theme [1,1] Binding, Deacetylation, Gene expression, Localization, Negative regulation, Phosphorylation, Positive regulation, Protein, Protein catabolism, Protein modification, Regulation, Transcription, Ubiquitination |
| EVENT | Protein catabolism [1,1] | Theme [1,1] Protein |
| EVENT | Protein modification [1,1] | Theme [1,1] Protein |
| EVENT | Regulation [1,3] | Cause [0,1] Binding, Gene expression, Localization, Negative regulation, Phosphorylation, Positive regulation, Protein, Protein modification, Regulation / Site {Cause,Theme} [0,1] Entity / Theme [1,1] Binding, Gene expression, Localization, Negative regulation, Phosphorylation, Positive regulation, Protein, Protein catabolism, Protein modification, Regulation, Transcription |
| EVENT | Transcription [1,1] | Theme [1,1] Protein |
| EVENT | Ubiquitination [1,2] | Cause [0,1] Protein / Site {Theme} [0,1] Entity / Theme [1,1] Protein |
| RELATION | Coreference, directed | Subject(Anaphora) / Object(Anaphora, Entity, Protein) |
| RELATION | SiteParent, directed | Arg1(Entity) / Arg2(Protein) |
| MODIFIER | negation | Binding, Gene expression, Localization, Negative regulation, Phosphorylation, Positive regulation, Protein catabolism, Regulation, Transcription |
| MODIFIER | speculation | Binding, Gene expression, Localization, Negative regulation, Phosphorylation, Positive regulation, Protein catabolism, Regulation, |
| TARGET | ENTITY | Acetylation, Anaphora, Binding, Deacetylation, Entity, Gene expression, Localization, Negative regulation, Phosphorylation, Positive regulation, Protein catabolism, Protein modification, Regulation, Transcription, Ubiquitination |
| TARGET | INTERACTION | Cause, Coreference, Site, SiteParent, Theme, ToLoc |
The entities form the nodes of the graph. The events and relations connect the nodes together and are defined by a type and a specific, limited set of arguments. Both event and relation arguments have specified valid target node types. Site-argument primary argument types are shown in wavy brackets. Event arguments have also a minimum and maximum amount of each argument allowed per event, and the event itself has a minimum and maximum number of arguments. The relations must always have two arguments and can optionally be directed. Modifiers are binary attributes that can be applied to a limited set of node types. The targets define the types of nodes and edges to be automatically extracted.
Figure 2Multiple approaches (A and B) were used in the BioNLP'11 Shared Task for representing site-arguments in the TEES graph format. In TEES 2.0 these representations have been merged into the unified representation (C), allowing site-arguments to be processed like any other event arguments.
Figure 3The visualizer provided with TEES 2.2 can be used to display both the event annotation as well as the parse of a sentence. This figure shows sentence GE13.d216.s0, taken from the BioNLP 2013 GENIA development corpus document PMC-3333881-20-Caption-Figure 3, demonstrating a nested event structure consisting of two Negative regulation events.
Official BioNLP 2013 Shared Task results for the TEES system showing performance on the hidden test sets.
| Task | # | R | P | F | SER |
|---|---|---|---|---|---|
| GE13 | 2/10 | 46.17 | 56.32 | 50.74 | |
| CG13 | 1/6 | 48.76 | 64.17 | 55.41 | |
| PC13 | 2/2 | 47.15 | 55.78 | 51.10 | |
| GRO13 | 1/1 | 15.22 | 36.58 | 21.50 | |
| GRN13 | 3/5 | 33 | 78 | 46 | 0.86 |
| BB13 T1 | 0/4 | ||||
| BB13 T2 | 1/4 | 28 | 82 | 42 | |
| BB13 T3 | 1/2 | 12 | 18 | 14 |
The performance is defined by the (F)-score, composed of (R)ecall and (P)recision. The SER metric is used for the the GRN task. The BB task 1 is outside the scope of the TEES system. Placement among other systems is indicated by #.
Figure 4The performance of systems that took part in the BioNLP'13 Shared Task. The TEES results are shown with black crosses. Please note that in tasks GRN and BBT1 the metric is SER*100 where a smaller score is better.
Turku Event Extraction System in the BioNLP Shared Tasks.
| Task | Name | devel / test | devel 2.2 / test 2.2 |
|---|---|---|---|
| GE09 1 | GENIA Event Extraction | - / 51.95 | 49.11 / -a |
| GE09 2 | Protein Site Arguments | - / - | - / -a |
| GE09 3 | Negation & Speculation | - / - | - / -a |
| GE11 1 | GENIA Event Extraction | 55.78 / 53.30 | 53.91 / 54.03 |
| GE11 2 | Protein Site Arguments | 53.39 / 51.97 | -b/-b |
| GE11 3 | Negation & Speculation | 38.34 / 26.86 | 37.92 / 31.85 |
| EPI11 | Epigenetics and PTM:s | 56.41 / 53.33 | 60.03 / 56.22 |
| ID11 | Infectious Diseases | 44.92 / 42.57 | 50.56 / 49.96 |
| BB11 | Bacteria Biotopes | 27.01 / 26 | 30.87 / -a |
| BI11 | Bacteria Gene Interactions | 77.24 / 77 | 76.81 / -a |
| CO11 | Protein/Gene Coreference | 36.22 / 23.77 | 30.11 / -a |
| REL11 | Entity Relations | 65.99 / 57.7 | -/-a |
| REN11 | Bacteria Gene Renaming | 84.62 / 87.0 | 85.04 / -a |
| GE13 | GENIA Event Extraction | 51.43* / 50.74 | 50.13* / 49.18 |
| CG13 | Cancer Genetics | 61.82* / 55.41 | 63.50* / 54.99 |
| PC13 | Pathway Curation | 57.63* / 51.10 | 59.74* / 49.90 |
| GRO13 | Gene Regulation Ontology | 47.18* / 21.50 | 47.42* / -a |
| GRN13 | Gene Regulation Network | -c / 0.86 SER | -c / 0.85 SER |
| BB13 1 | NER and Categorization | -/- | -/- |
| BB13 2 | Bacteria Localization | 11.81* / 42 | 13.71* / 42.20 |
| BB13 3 | Bacteria Entities & Relations | 64.67* / 14 | 63.34* / 14.24 |
The first results column shows the official competition results from each BioNLP Shared Task in which TEES participated, reflective of the system's performance at that point in its development. The second column shows the performance of the current TEES 2.2 system. The development set results are evaluated using an official downloadable evaluator or the TEES internal evaluator (the latter shown with a star). The test set results are the results from the shared task (first column) or evaluated with the official online evaluation service (second column). The superscript (a) indicates that the online evaluation service is no longer available, (b) that that competition metric is not provided by the online evaluation service and (c) that the downloadable evaluator failed to process the predicted events.
System component performance in F-score evaluated by replacing individual processing steps with an "All Correct" classifier that always returns the correct result.
| All Correct | Simple | Binding | Regulation | All |
|---|---|---|---|---|
| None | 76.87 | 50.44 | 42.87 | 55.99 |
| Entity | 93.17 | 65.72 | 65.37 | 75.32 |
| Edge | 89.70 | 71.63 | 72.46 | 78.62 |
| Unmerging | 86.53 | 77.30 | 61.07 | 72.76 |
| Entity + Edge | 98.40 | 77.99 | 93.38 | 93.32 |
| All | 98.40 | 94.63 | 94.75 | 95.99 |
The left column shows the step or steps replaced. The other columns show performance for GE11 event subsets, where Simple refers to the five single-argument event types and Regulation to the three regulation types. Performance is evaluated for task 1 on the development set with the downloadable GE11 evaluator.
GE11 event extraction with scikit-learn classifiers.
| Classifier | Parameters | Recall | Precision | F-score |
|---|---|---|---|---|
| BernoulliNB | alpha = 0.001,0.01,0.1,1,10,100,1000 | 53.41 | 14.93 | 23.34 |
| Perceptron | default | 38.82 | 61.73 | 47.67 |
| SVC | C=[10-3, 106], probability=True | 47.06 | 66.05 | 54.96 |
| LinearSVC | C=[10-3, 106] | 46.65 | 68.02 | 55.35 |
| ExtraTrees | n_estimators = 10,50,100 | 27.97 | 78.58 | 41.25 |
| RandomForest | n_estimators = 10,50,100,500 | 24.92 | 78.65 | 37.84 |
| SVM | C=[1,106] | 54.98 | 52.89 | 53.92 |
The SVMused in all TEES BioNLP Shared Task results is shown for reference. Performance is evaluated for task 1 on the development set with the downloadable GE11 evaluator.
Figure 5Examples for the feature groups in Figure 6 and Table 6. The numbered dependencies and tokens indicate the linear and dependency context for the token "phosphorylation". The dotted Theme edge and its corresponding dependency indicate the shortest path of an event argument edge. The example features correspond to the "phosphorylation" entity and the dotted edge. The token features TOK(x) are incorporated into the more complex features. (Figure adapted from [13].)
Figure 6The distribution of feature importances for feature groups, for each of the four classification steps (. The deps group refers to dependencies. In the box plots the boxes contain the features from the lower to upper quartiles, with a red line at the median. The dotted-line whiskers extend to 1.5 times the interquartile range and the outlier points are shown as individual markers. See Figure 5 for feature group details.
Most important features as determined by the scikit-learn ExtraTreesClassifier.
| step | # | weight | feature | group |
|---|---|---|---|---|
| trigger | 1 | 0.0087 | POS VB | token |
| 2 | 0.0078 | linear_3_txt_I | linear | |
| 3 | 0.0066 | stem_induct | subtoken | |
| 4 | 0.006 | dt_on | subtoken | |
| 5 | 0.0054 | linear_3_txt_we | linear | |
| 6 | 0.0054 | linear_3_txt_was | linear | |
| 7 | 0.0041 | linear_-1_txt_inhibits | linear | |
| 8 | 0.0041 | dt_si | subtoken | |
| 9 | 0.0038 | dist_3_annType_Protein | dependencies | |
| 10 | 0.0034 | dt_xp | subtoken | |
| edge | 1 | 0.009 | e2_txt_Id1 | entity |
| 2 | 0.0042 | tok_FFtxt_phosphorylation | path | |
| 3 | 0.0039 | dep_Reverse_dobj | path | |
| 4 | 0.0036 | tokenPath_Positive_regulation_e1_Positive_regulation_ | path | |
| 5 | 0.0035 | GENIA_target_protein | entity | |
| 6 | 0.0034 | POS_VBZ | path | |
| 7 | 0.0034 | tok_RFFFtxt_mRNA | path | |
| 8 | 0.0028 | tok_RFFtxt_phosphorylation | path | |
| 9 | 0.0025 | tok_RRtxt_Id2 | path | |
| 10 | 0.0025 | txt_block | path | |
| unmerging | 1 | 0.0064 | argTheme_dep_Reverse_prep_of | args |
| 2 | 0.0062 | argTheme_POS_NN | args | |
| 3 | 0.006 | argTheme_txt_expression | args | |
| 4 | 0.0048 | trg_dt_up | trigger | |
| 5 | 0.0047 | trg_chain_dist_dist_2-rev_appos-rev_punct | trigger | |
| 6 | 0.0045 | trg_dt_xp | trigger | |
| 7 | 0.0043 | trg_tt_ssi | trigger | |
| 8 | 0.0041 | argTheme_txt_affected | args | |
| 9 | 0.0041 | trg_dt_ex | trigger | |
| 10 | 0.0041 | argThemetrg_dep_dist_dist_3dep | args | |
| modifier | 1 | 0.013 | t1HOut_neg_RB | dependencies |
| 2 | 0.013 | t1HOut_neg | dependencies | |
| 3 | 0.011 | t1HOut_nsubjpass_NAMED_ENT | dependencies | |
| 4 | 0.0089 | dep_dist_dist_3neg | dependencies | |
| 5 | 0.0074 | t1HOut_not | dependencies | |
| 6 | 0.0072 | dist_3_txt_not | dependencies | |
| 7 | 0.0053 | dist_3_txt_significantly | dependencies | |
| 8 | 0.0048 | chain_dist_dist_1-rev_nsubjpass-frw_conj_and-rev_dep | dependencies | |
| 9 | 0.0044 | linear_-2_txt_was | linear | |
| 10 | 0.0032 | t1HOut_advmod | dependencies | |
See Figure 5 for feature group details.
The weights are relative for each classification step.