| Literature DB >> 33323106 |
Rui Xing1, Jie Luo2, Tengwei Song1.
Abstract
BACKGROUND: Although biomedical publications and literature are growing rapidly, there still lacks structured knowledge that can be easily processed by computer programs. In order to extract such knowledge from plain text and transform them into structural form, the relation extraction problem becomes an important issue. Datasets play a critical role in the development of relation extraction methods. However, existing relation extraction datasets in biomedical domain are mainly human-annotated, whose scales are usually limited due to their labor-intensive and time-consuming nature.Entities:
Keywords: Distant supervision; Information extraction; Medline; Relation extraction
Year: 2020 PMID: 33323106 PMCID: PMC7739482 DOI: 10.1186/s12859-020-03889-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Some examples in BioRel dataset
| Relation | Sentence |
|---|---|
| Anatomic structure has location | A brain mass and a |
| Therapeutic class of | The histamine induced facilitation was blocked completely by cimetidine and |
| Has physical part of anatomic structure | In |
| May treat | Treatment with oral |
| May be treated by | The ventricular effective refractory period, as well as the |
Parameter settings
| Settings | CNN | PCNN | GRU | MIML | MultiR | Mintz |
|---|---|---|---|---|---|---|
| Batch size | 256 | 128 | 128 | – | – | – |
| Epoch | 20 | 20 | 20 | 10 | 15 | 15 |
| Learning rate | 0.4 | 0.2 | 0.3 | – | – | – |
| Word dim | 200 | 200 | 200 | – | – | – |
| Position dim | 10 | 10 | 10 | – | – | – |
| Sentence dim | 230 | 230 | 230 | – | – | – |
| Window size | 3 | 5 | – | – | – | – |
| Dropout | 0.5 | 0.5 | 0.3 | – | – | – |
Fig. 1Precision/recall curves of CNN, PCNN and GRU-based models
Fig. 2Precision/recall curves of all baselines
P@N for distant supervised relation extraction models on BioRel
| Model | P@4000 (%) | P@8000 (%) | P@12000 (%) | P@16000 (%) | Mean (%) | F1 | AUC |
|---|---|---|---|---|---|---|---|
| CNN+ONE | 93.38 | 84.91 | 75.00 | 65.69 | 79.75 | 0.66 | 0.70 |
| CNN+AVE | 94.00 | 90.95 | 81.58 | 71.97 | 85.30 | 0.72 | 0.79 |
| CNN+ATT | 96.40 | 90.59 | 82.35 | 72.31 | 85.41 | 0.72 | 0.78 |
| PCNN+ONE | 92.15 | 83.80 | 74.53 | 65.46 | 78.98 | 0.65 | 0.69 |
| PCNN+AVE | 96.57 | 93.60 | 85.74 | 75.39 | 88.12 | 0.76 | 0.82 |
| PCNN+ATT | 96.15 | 91.11 | 83.27 | 73.40 | 85.98 | 0.73 | 0.79 |
| RNN+ONE | 88.89 | 81.05 | 71.58 | 63.13 | 76.16 | 0.63 | 0.66 |
| RNN+AVE | 96.67 | 92.83 | 83.97 | 73.53 | 87.00 | 0.74 | 0.80 |
| RNN+ATT | 94.60 | 89.63 | 81.81 | 72.54 | 84.65 | 0.72 | 0.78 |
| Mintz | 79.79 | 67.08 | 56.93 | 49.23 | 63.25 | 0.49 | 0.45 |
| MultiR | 72.70 | 66.93 | 40.32 | 20.21 | 50.04 | 0.30 | 0.23 |
| MIML | 73.35 | 59.01 | 48.63 | 31.40 | 53.09 | 0.43 | 0.39 |
Statistics of relation extraction datasets
| Dataset | Word | Sentence | Entity | Relation |
|---|---|---|---|---|
| SemEval-2010 | 205k | 10,717 | 21,434 | 9 |
| ACE 2003-2004 | 297k | 12,783 | 46,108 | 24 |
| NYT | 21,457k | 695,059 | 17,816 | 54 |
| BC5CDR | 282k | 11,089 | 29,271 | 1 |
| BB3 | 34k | 1394 | 2903 | 1 |
| SeeDev | 43k | 1549 | 7082 | 22 |
| GE4 | 134k | 5130 | 13,012 | 5 |
| i2b2 2010 | 91k | 6310 | 8296 | 11 |
| BioRel | 26,166k | 533,560 | 69,513 | 125 |
features for traditional statistical baselines
| Lexical | The sequence of words between the two entities |
| The part-of-speech of words between the two entities | |
| A flag indicating which entity came first in the sentence | |
| A window of k words to the left of the first entity and their part-of-speech tags | |
| A window of k words to the right of the second entity and their part-of-speech tags | |
| Syntactic | A dependency path between the two entities |
| Part-of-speech of words in dependency path | |
| A ‘window’ node that is not part of the dependency path |
Fig. 3Distant supervision process for BioRel dataset creation