| Literature DB >> 28096894 |
Yang Lu1, Xiaolei Ma1, Yinan Lu2, Yuxin Zhou2, Zhili Pei3.
Abstract
Biomedical event extraction is an important and difficult task in bioinformatics. With the rapid growth of biomedical literature, the extraction of complex events from unstructured text has attracted more attention. However, the annotated biomedical corpus is highly imbalanced, which affects the performance of the classification algorithms. In this study, a sample selection algorithm based on sequential pattern is proposed to filter negative samples in the training phase. Considering the joint information between the trigger and argument of multiargument events, we extract triplets of multiargument events directly using a support vector machine classifier. A joint scoring mechanism, which is based on sentence similarity and importance of trigger in the training data, is used to correct the predicted results. Experimental results indicate that the proposed method can extract events efficiently.Entities:
Mesh:
Year: 2016 PMID: 28096894 PMCID: PMC5206857 DOI: 10.1155/2016/7536494
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Class, event types, and their arguments for the GE task.
| Event class | Event type | Primary argument | Secondary argument |
|---|---|---|---|
| SVT | Gene_expression | Theme(P) | |
| Transcription | Theme(P) | ||
| Localization | Theme(P) | AtLoc, ToLoc | |
| Protein_catabolism | Theme(P) | ||
| Phosphorylation | Theme(P) | Site | |
|
| |||
| BIND | Binding | Theme(P)+ | Site+ |
|
| |||
| REG | Regulation | Theme(P/Ev), Cause(P/Ev) | Site, Csite |
| Positive_regulation | Theme(P/Ev), Cause(P/Ev) | Site, Csite | |
| Negative_regulation | Theme(P/Ev), Cause(P/Ev) | Site, Csite | |
P is protein; Ev is event.
Figure 1Structured representation of biomedical event.
Figure 2The framework of the proposed method.
Part of frequent sequential patterns.
| ID | Sequence database | Frequent sequential pattern |
|---|---|---|
|
| 〈 | 〈 |
|
| 〈 | |
|
| 〈 | |
|
| 〈 | |
|
| 〈 | |
|
| 〈 |
Algorithm 1Sample filter.
Figure 3The architecture of the C-DSSM.
Statistics on the GE data sets.
| Data set | Papers | Abstracts | Events | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Training | Devel | Test | Training | Devel | Test | Training | Devel | Test | |
| GE'13 | 10 | 10 | 14 | 0 | 0 | 0 | 2817 | 3199 | 3348 |
| GE'11 | 5 | 5 | 4 | 800 | 150 | 260 | 10310 | 4690 | 5301 |
Training is training data, Devel is development data, and Test is test data.
The ratio of the positive and negative samples on the training data.
|
| 3-3 | 4-1 | 4-2 | 4-3 | 5-1 | 5-2 | 5-3 | 6-1 | Original |
|---|---|---|---|---|---|---|---|---|---|
| Positive samples | 11419 | ||||||||
| Negative samples | 60742 | 77829 | 70901 | 66809 | 84910 | 74851 | 69180 | 85247 | 150308 |
| P : N | 1 : 5.319 | 1 : 6.816 | 1 : 6.209 | 1 : 5.851 | 1 : 7.436 | 1 : 6.555 | 1 : 6.058 | 1 : 7.465 | 1 : 13.163 |
X is minimum support (minsup), Y is threshold Θ, and P : N is the ratio of the positive and negative samples.
Figure 4(a) The F-score of four sequences on the GE'11 development data. (b) Comparison of the F-score of the sequence and original for each event.
The ratio of the positive and negative samples on the final training data.
|
| 3-3 | 4-1 | 4-2 | 4-3 | 5-1 | 5-2 | 5-3 | 6-1 | Original |
|---|---|---|---|---|---|---|---|---|---|
| Positive samples | 16375 | ||||||||
| Negative samples | 88054 | 112961 | 103045 | 97192 | 123122 | 108571 | 100341 | 123528 | 215383 |
| P : N | 1 : 5.377 | 1 : 6.898 | 1 : 6.292 | 1 : 5.935 | 1 : 7.519 | 1 : 6.630 | 1 : 6.128 | 1 : 7.544 | 1 : 13.153 |
X is minimum support (minsup), Y is threshold Θ, and P : N is the ratio of the positive and negative samples.
Results with/without triplets of the Binding events.
| Binding events | Without triplets | With triplets | ||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |
| GE'11 | 42.57 | 49.76 | 45.88 | 47.66 | 56.52 | 51.71 |
| GE'13 | 30.03 | 31.35 | 30.67 | 39.34 | 44.11 | 41.59 |
Performance is shown in recall (R), precision (P), and F-score (F).
Results of the proposed method on GE'11 test set.
| Event class | Event type | Whole | Abstract | Full text | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| ||
| SVT |
| 75.55 | 80.19 | 77.80 | 72.44 | 77.71 | 74.98 | 83.57 | 86.35 | 84.94 |
|
| 52.87 | 68.66 | 59.74 | 54.74 | 70.75 | 61.73 | 45.95 | 60.71 | 52.31 | |
|
| 66.67 | 76.92 | 71.43 | 64.29 | 81.82 | 72.00 | 100.00 | 50.00 | 66.67 | |
|
| 80.00 | 88.10 | 83.85 | 77.04 | 88.14 | 82.21 | 88.00 | 88.00 | 88.00 | |
|
| 35.60 | 86.08 | 50.37 | 32.76 | 95.00 | 48.72 | 64.71 | 57.89 | 61.11 | |
| Total | 68.60 | 80.34 | 74.01 | 64.97 | 79.34 | 71.44 | 79.74 | 82.97 | 81.32 | |
|
| ||||||||||
| BIND |
| 47.66 | 56.52 | 51.71 | 49.57 | 59.72 | 54.17 | 43.06 | 49.21 | 45.93 |
|
| ||||||||||
| REG |
| 32.47 | 47.53 | 38.58 | 31.96 | 51.38 | 39.41 | 34.04 | 39.02 | 36.36 |
|
| 41.03 | 42.35 | 41.68 | 41.10 | 40.64 | 40.87 | 40.87 | 46.53 | 43.52 | |
|
| 38.18 | 46.38 | 41.88 | 40.37 | 48.57 | 44.09 | 33.85 | 41.94 | 37.46 | |
| Total | 38.97 | 43.88 | 41.28 | 39.32 | 43.62 | 41.36 | 38.20 | 44.46 | 41.10 | |
|
| ||||||||||
| ALL total | 50.35 | 57.79 | 53.81 | 49.97 | 57.90 | 53.64 | 51.29 | 57.52 | 54.23 | |
Performance is shown in recall (R), precision (P), and F-score (F).
Comparison with other systems on GE'11 test set.
| System | Event class | ||||
|---|---|---|---|---|---|
| SVT | BIND | REG | ALL | ||
| FAUST | W | 68.47 80.25/73.90 | 44.20/53.71/48.49 | 38.02/ | 49.41/ |
| A |
| 45.53/58.09/51.05 | 39.38/ | 50.00/ | |
| F | 75.58/78.23/76.88 | 40.97/44.70/42.75 | 34.99/ | 47.92/58.47/52.67 | |
|
| |||||
| UMass | W | 67.01/ | 42.97/56.42/48.79 | 37.52/52.67/43.82 | 48.49/64.08/55.20 |
| A | 64.21/80.74/71.54 | 43.52/ | 38.78/55.07/45.51 | 48.74/65.94/56.05 | |
| F | 75.58/ | 41.67/47.62/44.44 | 34.72/47.51/40.12 | 34.72/47.51/40.12 | |
|
| |||||
| UTurku | W | 68.22/76.47/72.11 | 42.97/43.60/43.28 | 38.72/47.64/42.72 | 49.56/57.65/53.30 |
| A | 64.97/76.72/70.36 | 45.24/50.00/47.50 |
|
| |
| F | 78.18/75.82/76.98 | 37.50/31.76/34.39 | 34.99/44.46/39.16 | 48.31/53.38/50.72 | |
|
| |||||
| MSR-NLP | W |
| 42.36/40.47/41.39 | 36.64/44.08/40.02 | 48.64/54.71/51.50 |
| A | 65.99/74.71/70.08 | 43.23/44.51/43.86 | 37.14/45.38/40.85 | 48.52/56.47/52.20 | |
| F | 78.18/73.24/ 75.63 | 40.28/32.77/36.14 | 35.52/41.34/38.21 | 48.94/50.77/49.84 | |
|
| |||||
| STSS | W | — | — | — | — |
| A | 64.97/76.65/70.33 | 45.24/49.84/47.43 |
| 50.06/59.33/54.30 | |
| F | 78.18/75.63/76.88 | 37.50/31.58/34.29 | 34.99/44.69/39.25 | 48.31/53.43/50.74 | |
|
| |||||
| Ours | W | 68.60/80.34/ |
|
|
|
| A | 64.97/79.34/71.44 |
| 39.32/43.62/41.36 | 49.97/57.90/53.64 | |
| F |
|
|
|
| |
Evaluation results (recall/precision/F-score) in whole data set (W), abstracts only (A), and full papers only (F).
Results of the proposed method on GE'13 test set.
| Event class | Event type |
|
|
|
|---|---|---|---|---|
| SVT |
| 82.88 | 79.91 | 81.36 |
|
| 52.48 | 65.43 | 58.24 | |
|
| 64.29 | 50.00 | 56.25 | |
|
| 81.25 | 76.02 | 78.55 | |
|
| 31.31 | 83.78 | 45.59 | |
| Total | 74.12 | 77.56 | 75.80 | |
|
| ||||
| BIND |
| 39.34 | 44.11 | 41.59 |
|
| ||||
| REG |
| 23.61 | 39.08 | 29.44 |
|
| 39.56 | 46.71 | 42.84 | |
|
| 39.73 | 46.24 | 42.74 | |
| Total | 37.24 | 45.74 | 41.05 | |
|
| ||||
| ALL total | 48.65 | 56.24 | 52.17 | |
Performance is shown in recall (R), precision (P), and F-score (F).
Comparison with other systems on GE'13 test set.
| System | Event class | |||
|---|---|---|---|---|
| SVT | BIND | REG | ALL | |
| TEES 2.1 |
| 42.34/44.34/43.32 | 33.08/44.78/38.05 | 46.60/56.32/51.00 |
| EVEX | 73.82/77.73/75.72 | 41.14/44.77/42.88 | 32.41/47.16/38.41 | 45.87/58.03/51.24 |
| BioSEM | 70.09/ |
| 28.19/ | 42.84/ |
| Ours | 74.12/77.56/75.80 | 39.34/44.11/41.59 |
|
|
Performance is shown in recall (R), precision (P), and F-score (F).