| Literature DB >> 29099838 |
Gurusamy Murugesan1, Sabenabanu Abdulkadhar1, Jeyakumar Natarajan1.
Abstract
Automatic extraction of protein-protein interaction (PPI) pairs from biomedical literature is a widely examined task in biological information extraction. Currently, many kernel based approaches such as linear kernel, tree kernel, graph kernel and combination of multiple kernels has achieved promising results in PPI task. However, most of these kernel methods fail to capture the semantic relation information between two entities. In this paper, we present a special type of tree kernel for PPI extraction which exploits both syntactic (structural) and semantic vectors information known as Distributed Smoothed Tree kernel (DSTK). DSTK comprises of distributed trees with syntactic information along with distributional semantic vectors representing semantic information of the sentences or phrases. To generate robust machine learning model composition of feature based kernel and DSTK were combined using ensemble support vector machine (SVM). Five different corpora (AIMed, BioInfer, HPRD50, IEPA, and LLL) were used for evaluating the performance of our system. Experimental results show that our system achieves better f-score with five different corpora compared to other state-of-the-art systems.Entities:
Mesh:
Substances:
Year: 2017 PMID: 29099838 PMCID: PMC5669485 DOI: 10.1371/journal.pone.0187379
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1System overview.
Word and distance feature vector.
| Feature Names | Feature Values |
|---|---|
| P- | |
| b-and | |
| Words surrounding protein names | |
| ik-interact | |
| No. of non-proteins | |
| No. of proteins |
Fig 2Work flow for feature extraction in both feature based kernel and DSTK.
Fig 3Distributed Smoothed Tree (DST) A) Lexicalized parse tree for DST B) Subtrees of tree in lexicalized parse tree.
List of corpora used for evaluation.
| S.NO | CORPUS | COUNT | No. of interaction proteins |
|---|---|---|---|
| 1 | AIMed | 1955 Sentences | 1000 positive interaction pairs,4834 negative interaction pairs |
| 2 | BioInfer | 1100 Sentences | 2534 positive interaction pairs, 7132 negative interaction pairs |
| 3 | HPRD50 | 145 Sentences | 163 positive interaction pairs, 270 negative interaction pairs |
| 4 | IEPA | 486 Sentences | 335 positive interaction pairs, 482 negative interaction pairs. |
| 5 | LLL | 77 Sentences | 164 positive interaction pairs,166 negative interaction pairs |
Experimental results on three kernel feature based (Kfea), DSTK (KDSTK) and composite (Kckl).
| Corpus | AIMed | BioInfer | HPRD50 | IEPA | LLL | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| (%) | P | R | F | P | R | F | P | R | F | P | R | F | P | R | F |
| Kfea | 73.59 | 37.43 | 49.62 | 79.0 | 57.12 | 66.30 | 68.5 | 54.5 | 60.70 | 80.0 | 57.23 | 66.72 | 87.83 | 79.14 | 83.25 |
| KDSTK | 64.25 | 68.50 | 66.30 | 69.25 | 72.07 | 72.30 | 75.02 | 85.32 | 87.42 | ||||||
| Kckl | 76.25 | 75.85 | 87.31 | ||||||||||||
Fig 4ROC plotting of our three different kernels (feature based, DSTK, Composite) in five corpora a) AIMed b) BioInfer c) HPRD50 d) IEPA e) LLL.
Comparison of our method with (Li et al [30]) in AIMed Corpus.
| System | P | R | F |
|---|---|---|---|
| Li | 72.45 | 66.70 | 69.46 |
Comparison of our method with other kernel based methods.
| Corpus | AIMed | BioInfer | HPRD50 | IEPA | LLL |
|---|---|---|---|---|---|
| Li et al[ | 69.7 | 74.0 | 78.0 | 76.5 | 87.3 |
| Miwa et al [ | 60.8 | 68.1 | 70.9 | 71.7 | 80.1 |
| Choi et al [ | 67.0 | 72.6 | 73.1 | 73.1 | 82.1 |
| Satre et al[ | 64.2 | 67.6 | 69.7 | 74.4 | 80.5 |
| Satre et al [ | 52.0 | ||||
| Miyao et al[ | 59.5 |
Comparison of our method with other non-kernel methods.
| Corpus | AIMed | BioInfer | HPRD50 | IEPA | LLL |
|---|---|---|---|---|---|
| 56.12 | 61.26 | 71.28 | 74.19 | 80.99 | |
| 45.1 | - | 72.6 | 69.8 | 76.5 | |
| 63.5 | 65.3 | - | - | - |
Complex sentences extracted while annotating PPI.
| S.No | Corpus | Interaction sentences |
|---|---|---|
| 1 | BioInfer | Immunopercipitation of metabolically labeled proteins with |
| 2 | HPRD | In addition, coexpression of |
| 3 | IEPA | The hydrophilic form of MDP released from the cells on stimulation with |