| Literature DB >> 34991458 |
Jing Chen1, Baotian Hu2, Weihua Peng3, Qingcai Chen4,5, Buzhou Tang1,6.
Abstract
BACKGROUND: In biomedical research, chemical and disease relation extraction from unstructured biomedical literature is an essential task. Effective context understanding and knowledge integration are two main research problems in this task. Most work of relation extraction focuses on classification for entity mention pairs. Inspired by the effectiveness of machine reading comprehension (RC) in the respect of context understanding, solving biomedical relation extraction with the RC framework at both intra-sentential and inter-sentential levels is a new topic worthy to be explored. Except for the unstructured biomedical text, many structured knowledge bases (KBs) provide valuable guidance for biomedical relation extraction. Utilizing knowledge in the RC framework is also worthy to be investigated. We propose a knowledge-enhanced reading comprehension (KRC) framework to leverage reading comprehension and prior knowledge for biomedical relation extraction. First, we generate questions for each relation, which reformulates the relation extraction task to a question answering task. Second, based on the RC framework, we integrate knowledge representation through an efficient knowledge-enhanced attention interaction mechanism to guide the biomedical relation extraction.Entities:
Keywords: Biomedical relation extraction; Knowledge attention mechanism; Reading comprehension
Mesh:
Year: 2022 PMID: 34991458 PMCID: PMC8734165 DOI: 10.1186/s12859-021-04534-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The sample document. Chemical and disease mentions are marked in blue and red, respectively. CID means the chemical-induced disease relation
Fig. 2The overview of the knowledge-enhanced RC model
The overall statistics of the CDR and CHR datasets
| Dataset | Splits | Documents | Chemical IDs | Disease IDs | pos | neg |
|---|---|---|---|---|---|---|
| CDR | Training | 500 | 1479 | 1961 | 1038 | 4479 |
| Development | 500 | 1519 | 1851 | 1012 | 4310 | |
| Test | 500 | 1455 | 2007 | 1066 | 4471 | |
| CHR | Training | 7298 | 28158 | – | 19643 | 69843 |
| Development | 1182 | 4575 | – | 3185 | 11466 | |
| Test | 3614 | 13800 | – | 9578 | 33339 |
Doc-level performance comparison over our proposed model without and with knowledge on the CDR dataset
| KBs | Model | P (%) | R (%) | F1 (%) |
|---|---|---|---|---|
| Traditional ML | ME [ | 62.00 | 55.10 | 58.30 |
| Kernel-based SVM [ | 53.20 | 69.70 | 60.30 | |
| NN-based ML | ||||
| Relation classification | CNN+SDP [ | 58.02 | 76.20 | 65.88 |
| LSTM+CNN [ | 56.20 | 68.00 | 61.50 | |
| BRAN(Transformer) [ | 55.60 | 70.80 | 62.10 | |
| CNN+CNNchar [ | 57.00 | 68.60 | 62.30 | |
| GCNN [ | 52.80 | 66.00 | 58.60 | |
| Sequence labeling | Bio-Seq(LSTM+CRF) [ | 60.00 | 67.50 | 63.50 |
| Reading comprehension | RC (Ours) | 65.83 | 66.32 | |
| Traditional ML | ||||
| SVM+Rules(+CTD)[ | 68.15 | 66.04 | 67.08 | |
| SVM(+CTD+SIDER+MEDI+MeSH) [ | 65.80 | 68.57 | 67.16 | |
| Kernel-based models(+CTD) [ | 60.84 | 76.36 | 67.72 | |
| SVM(+Euretos KB) [ | 73.10 | 67.60 | 70.20 | |
| NN-based ML | ||||
| Relation classification | CAN(+CTD) [ | 60.51 | 80.48 | 69.08 |
| LSTM+CNN(+CTD) [ | 63.60 | 76.80 | 69.60 | |
| RPCNN(+CTD+SIDER+MEDI+MeSH fea) [ | 65.24 | 77.21 | 70.77 | |
| KCN(+CTD) [ | 69.65 | 72.98 | ||
| Reading comprehension | KRC(+DCh-Miner) (Ours) | 65.33 | 67.17 | 66.23 |
| KRC(+CTD) (Ours) | 71.93 | 70.45 | ||
‘fea’ denotes features
Doc-level performance comparison over our proposed model without knowledge on the CHR dataset
| Model | P (%) | R (%) | F1 (%) | |
|---|---|---|---|---|
| Relation classification | CNN-RE [ | 81.2 | 87.3 | 84.1 |
| RNN-RE [ | 83.0 | 90.1 | 86.4 | |
| GCNN [ | 84.7 | 90.5 | 87.5 | |
| Reading comprehension | RC (Ours) | 93.5 | 93.0 | |
Results over LMs finetuned by open-domain reading comprehension dataset without KBs on the CDR dataset
| Model | Intra sentential level | Inter sentential level | Document level | ||||||
|---|---|---|---|---|---|---|---|---|---|
| P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) | |
| biobert | 62.27 | 59.59 | 49.80 | 18.61 | 59.85 | 64.20 | |||
| biobert+SQuAD | 54.50 | 11.16 | 66.32 | ||||||
Results over pseudo queries and natural queries without KBs on the CDR dataset. ‘Natural Query’ means natural language queries
| Model | Intra sentential level | Inter sentential level | Document level | ||||||
|---|---|---|---|---|---|---|---|---|---|
| P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) | |
| Pseudo Query | 66.75 | 52.91 | 59.03 | 57.66 | 64.90 | 65.57 | 65.24 | ||
| Natural Query | 11.16 | 18.83 | |||||||
Ablation study over our proposed KRC model on the CDR dataset
| Model | Intra sentential level | Inter sentential level | Document level | ||||||
|---|---|---|---|---|---|---|---|---|---|
| P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) | |
| Only KB | 58.61 | 60.39 | 39.56 | 52.86 | 64.16 | ||||
| RC | 67.09 | 54.50 | 60.14 | 60.10 | 11.16 | 18.83 | 65.83 | 66.32 | 66.07 |
| RC+KB(atten2) | 67.94 | 56.85 | 61.90 | 62.00 | 11.63 | 19.59 | 66.88 | 69.14 | 67.99 |
| RC+KB(atten1+atten2) | 58.07 | 11.73 | 20.02 | 70.45 | |||||
metrics on intra and inter sentential levels, we used the calculation methods in [1]. If using the calculation methods in [24], F1 measures of our model without KBs (RC in the table) are 73.82% and 57.07% respectively on intra and inter sentential levels
Doc-level performance on the CDR dataset with different scales of CTD knowledge
| KB(CTD) | 0.25 | 0.50 | 0.75 | 1.0 |
| Document-level F1 (%) | 63.68 | 64.81 | 67.09 | 71.18 |
Fig. 3The proportion of relation combination types extracted from CTD in correctly predicted cases on the intra sentential and inter sentential level
Good cases and bad cases on our RC model with and without KBs
Chemical and disease mentions are marked in blue and red respectively. Incorrect predicted answers are marked in teal. The gold answers of instances in line 1, line 2 and line 3 are ‘syndrome of inappropriate secretion of antidiuretic hormone’, ‘cataleptic’ and ‘Posterior reversible encephalopathy syndrome’ respectively. The gold relations of instances in line 1, line 2 and line 3 are ‘induced’
Fig. 4The error distribution. FNs denotes the false negative examples. FPs denotes the false positive examples. FNs(MI) denotes the missing instances for predicting in FNs. FNs(EPI) denotes the error predicted instances in FNs