| Literature DB >> 30755172 |
Dat Quoc Nguyen1, Karin Verspoor2.
Abstract
BACKGROUND: Given the importance of relation or event extraction from biomedical research publications to support knowledge capture and synthesis, and the strong dependency of approaches to this information extraction task on syntactic information, it is valuable to understand which approaches to syntactic processing of biomedical text have the highest performance.Entities:
Keywords: Biomedical event extraction; Dependency parsing; Neural networks; POS tagging
Mesh:
Year: 2019 PMID: 30755172 PMCID: PMC6373122 DOI: 10.1186/s12859-019-2604-0
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Diagram outlining the design of experiments
Parsing results on the test set with predicted POS tags and gold tokenization (except [] which denotes results when employing gold POS tags in both training and testing phases)
| System | With punctuation | Without punctuation | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Overall | Exact match | Overall | Exact match | ||||||
| LAS | UAS | LAS | UAS | LAS | UAS | LAS | UAS | ||
| GENIA | |||||||||
| Pre-trained | Stanford-NNdep [ ∙] | 86.66 | 88.22 | 25.15 | 29.26 | 87.31 | 89.02 | 25.88 | 30.22 |
| Stanford-Biaffine-v1 [ ∙] | 84.69 | 87.95 | 16.25 | 26.10 | 84.92 | 88.55 | 16.99 | 28.24 | |
| Stanford-NNdep | 86.79 | 88.13 | 25.22 | 29.19 | 87.43 | 88.91 | 25.88 | 30.15 | |
| Stanford-Biaffine-v1 | 84.72 | 87.89 | 16.47 | 25.81 | 84.94 | 88.45 | 17.06 | 27.79 | |
| BLLIP+Bio |
|
|
|
|
|
|
|
| |
| GENIA | |||||||||
| Retrained | Stanford-NNdep | 87.02 | 88.34 | 25.74 | 30.07 | 87.56 | 89.02 | 26.03 | 30.59 |
| NLP4J-dep | 88.20 | 89.45 | 28.16 | 31.99 | 88.87 | 90.25 | 28.90 | 32.94 | |
| jPTDP-v1 | 90.01 | 91.46 | 29.63 | 35.74 | 90.27 | 91.89 | 30.29 | 37.06 | |
| Stanford-Biaffine-v2 |
|
|
|
|
|
| 34.41 |
| |
| Stanford-Biaffine-v2 [ | 91.68 | 92.51 | 36.99 | 40.44 | 91.92 | 92.84 | 38.01 | 41.84 | |
| CRAFT | |||||||||
| Retrained | Stanford-NNdep | 84.76 | 86.64 | 25.31 | 30.40 | 85.59 | 87.81 | 25.48 | 30.96 |
| NLP4J-dep | 86.98 | 88.85 | 27.60 | 33.71 | 87.62 | 89.80 | 28.16 | 34.60 | |
| jPTDP-v1 | 88.27 | 90.08 | 29.68 | 36.06 | 88.66 | 90.79 | 30.24 | 37.12 | |
| Stanford-Biaffine-v2 |
|
|
|
|
|
|
|
| |
| Stanford-Biaffine-v2 [ | 91.43 | 92.93 | 35.22 | 41.99 | 91.69 | 93.47 | 35.61 | 42.95 | |
“Without punctuation” refers to results excluding punctuation and other symbols from evaluation. “Exact match” denotes the percentage of sentences whose predicted trees are entirely correct [25]. [ ∙] denotes the use of the pre-trained Stanford tagger for predicting POS tags on test set, instead of using the retrained NLP4J-POS model. Score differences between the “retrained” parsers on both corpora are significant at p≤0.001 using McNemar’s test (except UAS scores obtained by Stanford-Biaffine-v2 for gold and predicted POS tags on GENIA, i.e. 92.51 vs. 92.31 and 92.84 vs. 92.64, where p≤0.05)
Biomedical event extraction results
| System | Development | Test | |||||
|---|---|---|---|---|---|---|---|
| R | P | F 1 | R | P | F 1 | ||
| Stanford&Paris | 49.92 | 55.75 | 52.67 | 45.03 | 56.93 | 50.29 | |
| BLLIP+Bio | 47.90 | 61.54 | 53.87 52.35 | 41.45 | 60.45 | 49.18 49.19 | |
| GENIA | Stanford-Biaffine-v2 | 50.53 | 56.47 | 53.34 | 43.87 | 56.36 | 49.34 |
| jPTDP-v1 | 49.30 | 58.58 |
| 42.11 | 54.94 | 47.68 48.88 | |
| NLP4J-dep | 51.93 | 55.15 | 53.49 52.20 | 45.88 | 55.53 |
| |
| Stanford-NNdep | 46.79 | 60.36 | 52.71 51.38 | 40.16 | 59.75 | 48.04 48.51 | |
| CRAFT | Stanford-Biaffine-v2 | 49.47 | 57.98 | 53.39 | 42.08 | 58.65 |
|
| jPTDP-v1 | 49.36 | 58.22 |
| 40.82 | 58.57 | 48.11 49.57 | |
| NLP4J-dep | 48.91 | 53.13 | 50.93 51.03 | 41.95 | 51.88 | 46.39 47.46 | |
| Stanford-NNdep | 46.34 | 56.83 | 51.05 51.01 | 38.87 | 59.64 | 47.07 46.38 | |
The subscripts denote results for which TEES is trained without the dependency labels
The number of files (#file), sentences (#sent), word tokens (#token) and out-of-vocabulary (OOV) percentage in each experimental dataset
| Dataset | #file | #sent | #token | OOV | |
|---|---|---|---|---|---|
| GENIA | Training | 1701 | 15,820 | 414,608 | 0.0 |
| Development | 148 | 1361 | 36,180 | 4.4 | |
| Test | 150 | 1360 | 35,639 | 4.4 | |
| CRAFT | Training | 55 | 18,644 | 481,247 | 0.0 |
| Development | 6 | 1280 | 31,820 | 6.6 | |
| Test | 6 | 1786 | 47,926 | 6.3 |
Statistics by the most frequent dependency and overlapped POS labels, sentence length (i.e. number of words in the sentence) and relative dependency distances i−j from a dependent w to its head w
| Dependency labels | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| GENIA | CRAFT | POS tags | Length | Distance | |||||||
| Type | % | Type | % | Type | % | % | Type | % | Type | % | % |
| advmod | 2.3 | ADV | 4.0 | CC | 3.6 | 3.2 | GENIA | <−5 | 4.1 | 3.9 | |
| amod | 9.6 | AMOD | 1.9 | CD | 1.6 | 4.0 | 1-10 | 3.5 | −5 | 1.2 | 1.2 |
| appos | 1.2 | CONJ | 3.6 | DT | 7.6 | 6.6 | 11-20 | 31.0 | −4 | 2.1 | 2.1 |
| aux | 1.4 | COORD | 3.2 | IN | 12.9 | 11.3 | 21-30 | 35.7 | −3 | 4.4 | 3.2 |
| auxpass | 1.5 | DEP | 1.0 | JJ | 10.1 | 7.6 | 31-40 | 19.4 | −2 | 10.6 | 8.5 |
| cc | 3.5 | LOC | 1.7 | NN | 29.3 | 24.2 | 41-50 | 7.1 | −1 | 24.1 | 21.7 |
| conj | 3.9 | NMOD | 33.7 | NNS | 6.9 | 6.6 | >50 | 3.3 | 1 | 19.0 | 26.5 |
| dep | 2.1 | OBJ | 2.8 | RB | 2.5 | 2.4 | 2 | 9.4 | 9.8 | ||
| det | 7.2 | P | 18.4 | TO | 1.6 | 0.6 | CRAFT | 3 | 6.3 | 5.9 | |
| dobj | 3.1 | PMOD | 10.6 | VB | 1.1 | 1.1 | 1-10 | 17.8 | 4 | 4.0 | 3.4 |
| mark | 1.1 | PRD | 0.9 | VBD | 2.1 | 2.2 | 11-20 | 23.1 | 5 | 2.4 | 2.3 |
| nn | 11.6 | PRN | 1.9 | VBG | 1.0 | 1.1 | 21-30 | 25.2 | >5 | 12.3 | 11.6 |
| nsubj | 4.1 | ROOT | 3.9 | VBN | 3.1 | 3.8 | 31-40 | 17.5 | - | - | - |
| nsubjpass | 1.4 | SBJ | 4.9 | VBP | 1.4 | 1.1 | 41-50 | 9.3 | - | - | - |
| num | 1.2 | SUB | 0.9 | VBZ | 1.9 | 1.4 | >50 | 7.1 | - | - | - |
| pobj | 12.2 | TMP | 0.9 | - | - | - | - | - | - | - | - |
| prep | 12.3 | VC | 2.4 | - | - | - | - | - | - | - | - |
| punct | 10.4 | - | - | - | - | - | - | - | - | - | - |
| root | 3.8 | - | - | - | - | - | - | - | - | - | - |
In addition, % and % denote the occurrence proportions in GENIA and CRAFT, respectively
POS tagging accuracies on the test set with gold tokenization
| Model | GENIA | CRAFT |
|---|---|---|
| MarMoT | 98.61 | 97.07 |
| jPTDP-v1 | 98.66 | 97.24 |
| NLP4J-POS | 98.80 | 97.43 |
| BiLSTM-CRF | 98.44 | 97.25 |
| + CNN-char |
| 97.51 |
| + LSTM-char | 98.85 |
|
| Stanford tagger [ ⋆] | 98.37 | _ |
| GENIA tagger [ ⋆] | 98.49 | _ |
[ ⋆] denotes a result with a pre-trained POS tagger. We do not provide accuracy results of the pre-trained POS taggers on CRAFT because CRAFT uses an extended PTB POS tag set (i.e. there are POS tags in CRAFT that are not defined in the original PTB POS tag set). Corpus-level accuracy differences of at least 0.17% in GENIA and 0.26% in CRAFT between two POS tagging models are significant at p≤0.05. Here, we compute sentence-level accuracies, then use paired t-test to measure the significance level
Fig. 2LAS scores by sentence length. Scores obtained on GENIA and CRAFT are presented in the left and right figures, respectively
Fig. 3LAS (F1) scores by dependency distance. Scores obtained on GENIA and CRAFT are presented in the left and right figures, respectively
LAS (F1) scores of Stanford-Biaffine on GENIA, by frequent dependency labels in the left dependencies
| Type | <−5 | −5 | −4 | |||
|---|---|---|---|---|---|---|
| Prop. | LAS | Prop. | LAS | Prop. | LAS | |
| advmod | 7.2 |
| 4.2 | 90.91 | 4.6 | 88.52 |
| amod | 4.8 | 74.19 | 8.1 | 80.00 | 17.5 |
|
| det | 4.3 | 85.71 | 17.7 |
| 21.3 | 88.97 |
| mark | 15.4 | 98.49 | 11.5 |
| 6.4 | 97.62 |
| nn | 4.7 | 74.38 | 15.7 |
| 16.6 | 76.71 |
| nsubj | 28.2 | 93.96 | 19.0 | 94.67 | 15.3 |
|
| nsubjpass | 15.9 |
| 11.3 | 92.13 | 3.9 | 86.27 |
| prep | 11.9 | 96.10 | 6.7 |
| 2.6 | 88.24 |
“Prop.” denotes the occurrence proportion in each distance bin
LAS by the basic Stanford dependency labels on GENIA
| Type | Biaffine | jPTDP | NLP4J | NNdep | Avg. |
|---|---|---|---|---|---|
| advmod |
| 86.77 | 87.26 | 83.86 | 86.32 |
| amod |
| 92.21 | 90.59 | 90.94 | 91.54 |
| appos |
| 83.25 | 80.41 | 77.32 | 81.32 |
| aux | 98.74 |
| 98.92 | 97.66 | 98.65 |
| auxpass | 99.32 | 99.32 |
| 99.32 | 99.36 |
| cc |
| 86.38 | 82.21 | 79.33 | 84.46 |
| conj |
| 78.64 | 73.32 | 69.40 | 76.30 |
| dep | 40.49 |
| 40.04 | 31.66 | 38.48 |
| det |
| 96.68 | 95.46 | 95.54 | 96.21 |
| dobj |
| 95.87 | 94.90 | 92.18 | 94.86 |
| mark |
| 90.38 | 89.62 | 90.89 | 91.39 |
| nn | 90.07 |
| 88.22 | 88.97 | 89.38 |
| nsubj |
| 94.71 | 93.18 | 90.75 | 93.62 |
| nsubjpass |
|
| 92.05 | 90.94 | 93.53 |
| num | 89.14 | 85.97 | 90.05 |
| 88.86 |
| pobj |
| 96.54 | 96.54 | 95.13 | 96.31 |
| prep |
| 89.93 | 89.18 | 88.31 | 89.49 |
| root |
| 97.13 | 94.78 | 92.87 | 95.52 |
“Avg.” denotes the averaged score of the four dependency parsers
LAS by the CoNLL 2008 dependency labels on CRAFT
| Type | Biaffine | jPTDP | NLP4J | NNdep | Avg. |
|---|---|---|---|---|---|
| ADV |
| 77.53 | 75.58 | 71.64 | 75.99 |
| AMOD |
| 83.45 | 85.00 | 82.98 | 84.47 |
| CONJ |
| 88.69 | 85.42 | 83.34 | 87.30 |
| COORD |
| 84.75 | 79.42 | 76.38 | 82.26 |
| DEP |
| 67.96 | 62.83 | 52.43 | 64.11 |
| LOC |
| 68.91 | 68.64 | 61.35 | 67.40 |
| NMOD |
| 91.19 | 90.77 | 90.04 | 91.14 |
| OBJ |
| 94.53 | 93.85 | 91.34 | 94.06 |
| PMOD |
| 94.85 | 94.52 | 93.44 | 94.78 |
| PRD |
| 90.11 | 92.49 | 90.66 | 91.81 |
| PRN |
| 61.30 | 49.26 | 46.96 | 54.91 |
| ROOT |
| 97.20 | 95.24 | 91.27 | 95.47 |
| SBJ |
| 93.03 | 91.82 | 90.11 | 92.71 |
| SUB |
| 91.81 | 91.81 | 89.64 | 92.11 |
| TMP |
| 68.81 | 65.71 | 59.73 | 68.25 |
| VC |
| 97.50 | 98.09 | 96.09 | 97.63 |
LAS by POS tag of the dependent
| Type | GENIA | CRAFT | ||||||
|---|---|---|---|---|---|---|---|---|
| Biaffine | jPTDP | NLP4J | NNdep | Biaffine | jPTDP | NLP4J | NNdep | |
| CC |
| 86.70 | 82.75 | 80.20 |
| 85.45 | 79.99 | 77.45 |
| CD |
| 79.30 | 79.78 | 79.30 |
| 85.17 | 84.22 | 79.77 |
| DT |
| 95.09 | 93.99 | 93.08 |
| 97.39 | 97.18 | 96.77 |
| IN |
| 89.50 | 88.41 | 87.58 |
| 79.32 | 78.43 | 75.97 |
| JJ |
| 89.35 | 88.30 | 87.76 |
| 92.91 | 92.50 | 91.70 |
| NN |
| 89.92 | 88.26 | 87.62 |
| 89.28 | 88.32 | 87.48 |
| NNS |
| 92.32 | 91.33 | 87.91 |
| 92.57 | 90.91 | 88.30 |
| RB |
| 86.92 | 87.73 | 84.61 |
| 81.98 | 82.13 | 76.99 |
| TO | 90.97 | 91.50 |
| 88.14 | 90.16 | 85.83 |
| 83.86 |
| VB |
| 87.84 | 85.09 | 83.49 |
|
| 98.67 | 96.38 |
| VBD |
| 93.85 | 90.97 | 90.34 |
| 93.21 | 90.03 | 86.86 |
| VBG |
| 79.47 | 79.20 | 72.27 |
| 81.33 | 81.15 | 75.57 |
| VBN |
| 90.53 | 88.02 | 85.51 |
| 91.24 | 90.25 | 88.04 |
| VBP |
| 93.88 | 92.54 | 90.63 |
| 91.18 | 88.98 | 84.09 |
| VBZ |
| 94.83 | 93.57 | 92.48 |
| 88.77 | 87.67 | 84.25 |
Error examples
| ID | Form | Gold | Prediction | ||||
|---|---|---|---|---|---|---|---|
| POS | H. | DEP | POS | H. | DEP | ||
| 19 | both | CC | 24 | preconj | CC |
| preconj |
| 20 | the | DT | 24 | det | DT |
|
|
| 21 | POU(S) | JJ | 24 | amod |
|
|
|
| 22 | and | CC | 21 | cc | CC | 21 | cc |
| 23 | POU(H) | NN | 21 | conj | NN | 21 | conj |
| 24 | domains | NNS | 18 | pobj | NNS |
|
|
| 23 | the | DT | 26 | det | DT |
| det |
| 24 | Oct-1-responsive | JJ | 26 | amod | JJ |
| amod |
| 25 | octamer | NN | 26 | nn | NN | 27 | nn |
| 26 | sequence | NN | 22 | pobj | NN |
|
|
| 27 | ATGCAAAT | NN | 26 | dep | NN |
|
|
“H.” denotes the head index of the current word
UAS and LAS (F1) scores of re-trained models on the pre-segmented BioNLP-2009 development sentences which contain event interactions
| Metric | Biaffine | jPTDP | NLP4J | NNdep |
|---|---|---|---|---|
| UAS | 95.51 | 93.14 | 92.50 | 91.02 |
| LAS | 94.82 | 92.18 | 91.96 | 90.30 |
Scores are computed on all tokens using the evaluation script from the CoNLL 2017 shared task [31]