| Literature DB >> 29382397 |
Nagesh C Panyam1, Karin Verspoor2, Trevor Cohn1, Kotagiri Ramamohanarao1.
Abstract
BACKGROUND: Relation extraction from biomedical publications is an important task in the area of semantic mining of text. Kernel methods for supervised relation extraction are often preferred over manual feature engineering methods, when classifying highly ordered structures such as trees and graphs obtained from syntactic parsing of a sentence. Tree kernels such as the Subset Tree Kernel and Partial Tree Kernel have been shown to be effective for classifying constituency parse trees and basic dependency parse graphs of a sentence. Graph kernels such as the All Path Graph kernel (APG) and Approximate Subgraph Matching (ASM) kernel have been shown to be suitable for classifying general graphs with cycles, such as the enhanced dependency parse graph of a sentence. In this work, we present a high performance Chemical-Induced Disease (CID) relation extraction system. We present a comparative study of kernel methods for the CID task and also extend our study to the Protein-Protein Interaction (PPI) extraction task, an important biomedical relation extraction task. We discuss novel modifications to the ASM kernel to boost its performance and a method to apply graph kernels for extracting relations expressed in multiple sentences.Entities:
Keywords: APG kernel; ASM kernel; Graph kernels; Relation extraction
Mesh:
Year: 2018 PMID: 29382397 PMCID: PMC5791373 DOI: 10.1186/s13326-017-0168-3
Source DB: PubMed Journal: J Biomed Semantics
Illustration of an annotated Pubmed abstract from the CDR corpus
| Title |
|
|---|---|
| Abstract |
|
| Entity | D011441, Chemical, “ |
| Entity | D011441, Chemical, “ |
| Entity | D056486, Disease, “ |
| Entity | D056486, Disease, “ |
| Entity | D006521, Disease, “ |
| Relation (CID) | D011441 - D006521 |
| Relation (CID) | D011441 - D056486 |
Fig. 1Illustration of different parse structures for the sentence :“Seizures were caused by Alcohol and Fatigue
Performance measurements for chemical induced disease relation extraction
| Method | Sent-Rel. only | Non-Sent-Rel. only | All relations | ||||||
|---|---|---|---|---|---|---|---|---|---|
| P | R | F | P | R | F | P | R | F | |
| SSTK with CP-Tree | 43.1 | 73.7 | 54.4 | 36.9 | 14.2 | 20.5 | 42.5 | 56.0 | 48.3 |
| PTK with LCT | 42.2 | 75.3 | 54.1 | 30.5 | 40.1 | 34.6 | 39.5 | 64.8 | 49.0 |
| APG with Dep. Graphs |
| 80.6 |
|
| 43.8 |
| 53.2 |
| 60.3 |
| ASM with Dep. Graphs | 51.6 |
| 63.0 | 38.8 | 36.0 | 37.3 | 49.0 | 67.4 | 56.8 |
| Hybrid (Prior art [ | - | - | - | - | - | - |
| 49.2 | 56.0 |
| Hybrid + Rules (Prior art [ | - | - | - | - | - | - | 55.6 | 68.4 |
|
(Key: P,R,F denotes Precision, Recall and F1 score respectively. Sent-Rel. and Non-Sent-Rel. denotes sentence level relations and Non-Sentence level relations respectively. CP-Tree and LCT denote constituency parse tree and location centered tree. Dep. Graph denotes dependency graph. The best performance is highlighted in italicized font)
Performance measurements for protein-protein interaction extraction
| Method | AIMed | BioInfer | HPRD50 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P | R | F | A | P | R | F | A | P | R | F | A | |
| SOA |
| 77.5 |
|
| 58.1 |
| 39.1 | 69.6 | 64.2 |
|
|
|
| APG | 28.6 |
| 42.3 | 76.8 |
| 28.6 |
| 69.7 | 62.3 | 69.9 | 65.9 | 79.7 |
| ASM | 26.3 | 78.0 | 39.3 | 72.9 | 67.2 | 22.6 | 33.8 |
|
| 58.3 | 61.9 | 76.2 |
| IEPA | LLL | |||||||||||
| P | R | F | A | P | R | F | A | |||||
| SOA. | 78.5 |
|
|
|
|
|
|
| ||||
| APG | 78.2 | 41.8 | 54.5 | 80.2 | 84.7 | 57.3 | 68.3 | 83.4 | ||||
| ASM |
| 17.3 | 28.6 | 77.7 | 79.3 | 28.0 | 41.4 | 75.3 | ||||
(Key: P,R, F and A denotes Precision, Recall, F score and area under curve respectively. SOA denotes State of the art performance with APG as reported in [26]). The best performance is highlighted in italicized font)
Statistical significance (McNemar’s) tests for the ASM and APG classifiers, for the null-hypothesis being that the two classifiers are equally accurate and a significance threshold of 0.01
| Dataset | Number of examples | Accuracy | |||
|---|---|---|---|---|---|
| Training | Testing | APG | ASM | ||
| AIMed | 11,246 | 5,834 | 58.6 | 53.1 |
|
| BioInfer | 7,414 | 9,666 | 77.8 | 76.8 |
|
| HPRD50 | 16,647 | 433 | 70.9 | 68.1 | 0.999 |
| IEPA | 16,263 | 817 | 73.6 | 66.5 |
|
| LLL | 16,750 | 330 | 75.4 | 65.1 |
|
| CID: Sentence level relations. | 9,913 | 5,099 | 72.2 | 71.2 | 0.0969 |
| CID: Non Sentence level relations | 21,656 | 11,562 | 84.9 | 84.1 |
|
P-values less than the threshold are shown in italicized font