| Literature DB >> 19025688 |
Antti Airola1, Sampo Pyysalo, Jari Björne, Tapio Pahikkala, Filip Ginter, Tapio Salakoski.
Abstract
BACKGROUND: Automated extraction of protein-protein interactions (PPI) is an important and widely studied task in biomedical text mining. We propose a graph kernel based approach for this task. In contrast to earlier approaches to PPI extraction, the introduced all-paths graph kernel has the capability to make use of full, general dependency graphs representing the sentence structure.Entities:
Mesh:
Year: 2008 PMID: 19025688 PMCID: PMC2586751 DOI: 10.1186/1471-2105-9-S11-S2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Shortest path example. Stanford dependency parses ("collapsed") representation where the shortest path, shown in bold, excludes important words.
Figure 2Graph representation. Graph representation generated from an example sentence. The candidate interaction pair is marked as PROT1 and PROT2, the third protein is marked as PROT. The shortest path between the proteins is shown in bold. In the dependency based subgraph all nodes in a shortest path are specialized using a post-tag (IP). In the linear order subgraph possible tags are (B)efore, (M)iddle, and (A)fter. For the other two candidate pairs in the sentence, graphs with the same structure but different weights and labels would be generated.
Evaluation results. Counts of positive and negative examples in the corpora and (P)recision, (R)ecall, (F)-score and AUC for the graph kernel, with standard deviations provided for F and AUC.
| Statistics | Graph Kernel | co-occ | ||||||||||
| Corpus | #POS. | #NEG. | P | R | F | AUC | P | F | ||||
| AIMed | 1000 | 4834 | 52.9% | 61.8% | 56.4% | 5.0% | 84.8% | 2.3% | 17.8% | 30.1% | ||
| BioInfer | 2534 | 7132 | 56.7% | 67.2% | 61.3% | 5.2% | 81.9% | 4.9% | 26.6% | 41.7% | ||
| HPRD50 | 163 | 270 | 64.3% | 65.8% | 63.4% | 11.4% | 79.7% | 6.3% | 38.9% | 55.4% | ||
| IEPA | 335 | 482 | 69.6% | 82.7% | 75.1% | 7.0% | 85.1% | 5.1% | 40.8% | 57.6% | ||
| LLL | 164 | 166 | 72.5% | 87.2% | 76.8% | 17.8% | 83.4% | 12.2% | 55.9% | 70.3% | ||
Figure 3Learning curves. Learning curves for the five corpora. The scale is logarithmic with respect to the amount of training data.
Cross-corpus results measured with AUC. AUC results for cross-corpus testing. Rows correspond to training corpora and columns to test corpora.
| AImed | rank | BioInfer | rank | HPRD50 | rank | IEPA | rank | LLL | rank | avg. rank | |
| AImed | - | - | 67.7% | 2 | 82.4% | 1 | 76.1% | 2 | 77.8% | 3 | 2 |
| BioInfer | 77.8% | 1 | - | - | 75.2% | 3 | 79.3% | 1 | 83.3% | 1 | 1.5 |
| HPRD50 | 72.5% | 2 | 61.8% | 3 | - | - | 74.9% | 3 | 64.0% | 4 | 3 |
| IEPA | 70.2% | 3 | 72.2% | 1 | 80.0% | 2 | - | - | 82.5% | 2 | 2 |
| LLL | 61.8% | 4 | 61.0% | 4 | 69.4% | 4 | 74.8% | 4 | - | - | 4 |
Cross-corpus results measured with F-score and threshold chosen on training set. F-score results for cross-corpus testing with the thresholds chosen on the training set. Rows correspond to training corpora and columns to test corpora. Δ denote the difference between the F-score result and the result achieved with the optimal threshold.
| AImed | Δ | BioInfer | Δ | HPRD50 | Δ | IEPA | Δ | LLL | Δ | |
| AImed | - | - | 24.9% | 22.2% | 64.6% | 4.4% | 22.9% | 44.5% | 17.7% | 56.8% |
| BioInfer | 44.2% | 3.0% | - | - | 63.6% | 0.3% | 64.5% | 3.5% | 76.4% | 1.6% |
| HPRD50 | 40.9% | 1.3% | 27.2% | 15.3% | - | - | 56.3% | 8.8% | 45.5% | 22.4% |
| IEPA | 38.4% | 0.7% | 47.0% | 4.7% | 65.6% | 1.9% | - | - | 77.0% | 0.6% |
| LLL | 32.6% | 0.7% | 42.2% | 0.3% | 58.3% | 1.5% | 63.9% | 1.0% | - | - |
Cross-corpus results measures with F-score and optimal thresholds. F-score results for cross-corpus testing with the optimal thresholds. Rows correspond to training corpora and columns to test corpora.
| AImed | BioInfer | HPRD50 | IEPA | LLL | |
| AImed | - | 47.1% | 69.0% | 67.4% | 74.5% |
| BioInfer | 47.2% | - | 63.9% | 68.0% | 78.0% |
| HPRD50 | 42.2% | 42.5% | - | 65.1% | 67.9% |
| IEPA | 39.1% | 51.7% | 67.5% | - | 77.6% |
| LLL | 33.3% | 42.5% | 59.8% | 64.9% | - |
Cross-corpus results measured with F-score and thresholds based on the distribution of test set. F-score results for cross-corpus testing with the thresholds chosen according to the positive/negative distribution of the test set. Rows correspond to training corpora and columns to test corpora. Δ denote the difference between the F-score result and the result achieved with the optimal threshold.
| AImed | Δ | BioInfer | Δ | HPRD50 | Δ | IEPA | Δ | LLL | Δ | |
| AImed | - | - | 44.7% | 2.4% | 65.6% | 3.4% | 63.9% | 3.5% | 70.1% | 4.4% |
| BioInfer | 42.6% | 4.6% | - | - | 62.0% | 1.9% | 66.9% | 1.1% | 75.6% | 2.4% |
| HPRD50 | 39.1% | 3.1% | 40.0% | 2.5% | - | - | 63.3% | 1.8% | 58.5% | 9.4% |
| IEPA | 33.5% | 5.6% | 48.4% | 3.3% | 66.3% | 1.2% | - | - | 77.4% | 0.2% |
| LLL | 26.5% | 6.8% | 38.7% | 3.8% | 54.0% | 5.8% | 63.0% | 1.9% | - | - |
Comparison on AImed. (P)recision, (R)ecall, (F)-score and AUC results for methods evaluated on AImed with the correct cross-validation methodology. Note that the best performing method, introduced by Miwa et al. [34], also utilizes the all-paths graph kernel.
| P | R | F | AUC | |
| Miwa et al. [ | - | - | 63.5% | 87.9% |
| Miyao et al. [ | 54.9% | 65.5% | 59.5% | - |
| Giuliano et al. [ | 60.9% | 57.2% | 59.0% | - |
| All-paths graph kernel | 52.9% | 61.8% | 56.4% | 84.8% |
| Sætre et al. [ | 64.3% | 44.1% | 52.0% | - |
| Mitsumori et al. [ | 54.2% | 42.6% | 47.7% | - |
| Van Landeghem et al. [ | 49% | 44% | 46% | - |
| Yakushiji et al. [ | 33.7% | 33.1% | 33.4% | - |