| Literature DB >> 23323857 |
Domonkos Tikk1, Illés Solt, Philippe Thomas, Ulf Leser.
Abstract
BACKGROUND: Kernel-based classification is the current state-of-the-art for extracting pairs of interacting proteins (PPIs) from free text. Various proposals have been put forward, which diverge especially in the specific kernel function, the type of input representation, and the feature sets. These proposals are regularly compared to each other regarding their overall performance on different gold standard corpora, but little is known about their respective performance on the instance level.Entities:
Mesh:
Year: 2013 PMID: 23323857 PMCID: PMC3680070 DOI: 10.1186/1471-2105-14-12
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The distribution of pairs according to classification success level using cross-validation setting. The distribution of pairs (total, positive and negative) in terms of the number of kernels that classify them correctly (success level) aggregated across the 5 corpora in cross-validation setting. Detailed data for each corpus can be find in Table 1. All 13 kernels are taken into consideration.
Figure 2The distribution of pairs according to classification success level using cross-learning setting. The distribution of pairs (total, positive and negative) in terms of the number of kernels that classify them correctly (success level) aggregated across the 5 corpora in cross-learning setting. Detailed data for each corpus can be find in Table 2. All kernels except for the very slow PT kernel are taken into consideration.
The distribution of pairs for each corpus according to classification success level using cross-validation setting
| | |||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 77 | 73 | 4 | 7.3% | 0.1% | 58 | 44 | 14 | 1.7% | 0.2% | 4 | 1 | 3 | 0.6% | 1.1% | 2 | 1 | 1 | 0.3% | 0.2% | 5 | 0 | 5 | 0.0% | 3.0% |
| 1 | 95 | 89 | 6 | 8.9% | 0.1% | 158 | 107 | 51 | 4.2% | 0.7% | 7 | 4 | 3 | 2.5% | 1.1% | 13 | 5 | 8 | 1.5% | 1.7% | 7 | 0 | 7 | 0.0% | 4.2% |
| 2 | 105 | 101 | 4 | 10.1% | 0.1% | 206 | 130 | 76 | 5.1% | 1.1% | 12 | 8 | 4 | 4.9% | 1.5% | 11 | 3 | 8 | 0.9% | 1.7% | 27 | 0 | 27 | 0.0% | 16.3% |
| 3 | 121 | 104 | 17 | 10.4% | 0.4% | 306 | 198 | 108 | 7.8% | 1.5% | 18 | 7 | 11 | 4.3% | 4.1% | 26 | 13 | 13 | 3.9% | 2.7% | 10 | 0 | 10 | 0.0% | 6.0% |
| 4 | 139 | 115 | 24 | 11.5% | 0.5% | 349 | 203 | 146 | 8.0% | 2.0% | 26 | 10 | 16 | 6.1% | 5.9% | 30 | 10 | 20 | 3.0% | 4.1% | 16 | 0 | 16 | 0.0% | 9.6% |
| 5 | 140 | 91 | 49 | 9.1% | 1.0% | 440 | 225 | 215 | 8.9% | 3.0% | 20 | 12 | 8 | 7.4% | 3.0% | 43 | 19 | 24 | 5.7% | 5.0% | 21 | 2 | 19 | 1.2% | 11.4% |
| 6 | 142 | 70 | 72 | 7.0% | 1.5% | 481 | 209 | 272 | 8.2% | 3.8% | 33 | 9 | 24 | 5.5% | 8.9% | 61 | 22 | 39 | 6.6% | 8.1% | 26 | 1 | 25 | 0.6% | 15.1% |
| 7 | 176 | 65 | 111 | 6.5% | 2.3% | 619 | 248 | 371 | 9.8% | 5.2% | 35 | 15 | 20 | 9.2% | 7.4% | 51 | 20 | 31 | 6.0% | 6.4% | 29 | 8 | 21 | 4.9% | 12.7% |
| 8 | 248 | 72 | 176 | 7.2% | 3.6% | 785 | 256 | 529 | 10.1% | 7.4% | 37 | 9 | 28 | 5.5% | 10.4% | 79 | 31 | 48 | 9.3% | 10.0% | 19 | 6 | 13 | 3.7% | 7.8% |
| 9 | 372 | 69 | 303 | 6.9% | 6.3% | 876 | 245 | 631 | 9.7% | 8.8% | 46 | 10 | 36 | 6.1% | 13.3% | 99 | 32 | 67 | 9.6% | 13.9% | 26 | 15 | 11 | 9.1% | 6.6% |
| 10 | 461 | 47 | 414 | 4.7% | 8.6% | 1067 | 204 | 863 | 8.1% | 12.1% | 61 | 33 | 28 | 20.2% | 10.4% | 101 | 38 | 63 | 11.3% | 13.1% | 31 | 19 | 12 | 11.6% | 7.2% |
| 11 | 619 | 29 | 590 | 2.9% | 12.2% | 1061 | 164 | 897 | 6.5% | 12.6% | 49 | 19 | 30 | 11.7% | 11.1% | 112 | 46 | 66 | 13.7% | 13.7% | 32 | 32 | 0 | 19.5% | 0.0% |
| 12 | 1002 | 43 | 959 | 4.3% | 19.8% | 1390 | 183 | 1207 | 7.2% | 16.9% | 57 | 13 | 44 | 8.0% | 16.3% | 106 | 47 | 59 | 14.0% | 12.2% | 45 | 45 | 0 | 27.4% | 0.0% |
| 13 | 2137 | 32 | 2105 | 3.2% | 43.5% | 1870 | 118 | 1752 | 4.7% | 24.6% | 28 | 13 | 15 | 8.0% | 5.6% | 83 | 48 | 35 | 14.3% | 7.3% | 36 | 36 | 0 | 22.0% | 0.0% |
The distribution of pairs (total, positive and negative) in terms of the number of kernels that classify them correctly. Results shown for each corpus separately. Aggregated results are shown in Figure 1. All the 13 kernels are taken into consideration.
The distribution of pairs for each corpus according to classification success level using cross-learning setting
| | |||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | 0 | 41 | 0.0% | 0.8% | 319 | 319 | 0 | 12.6% | 0.0% | 1 | 0 | 1 | 0.0% | 0.4% | 9 | 9 | 0 | 2.7% | 0.0% | 3 | 3 | 0 | 1.8% | 0.0% |
| 1 | 73 | 6 | 67 | 0.6% | 1.4% | 362 | 362 | 0 | 14.3% | 0.0% | 4 | 2 | 2 | 1.2% | 0.7% | 19 | 17 | 2 | 5.1% | 0.4% | 5 | 4 | 1 | 2.4% | 0.6% |
| 2 | 199 | 26 | 173 | 2.6% | 3.6% | 322 | 312 | 10 | 12.3% | 0.1% | 7 | 3 | 4 | 1.8% | 1.5% | 33 | 32 | 1 | 9.6% | 0.2% | 10 | 9 | 1 | 5.5% | 0.6% |
| 3 | 315 | 39 | 276 | 3.9% | 5.7% | 303 | 280 | 23 | 11.0% | 0.3% | 23 | 10 | 13 | 6.1% | 4.8% | 38 | 36 | 2 | 10.7% | 0.4% | 19 | 19 | 0 | 11.6% | 0.0% |
| 4 | 489 | 71 | 418 | 7.1% | 8.6% | 321 | 260 | 61 | 10.3% | 0.9% | 27 | 15 | 12 | 9.2% | 4.4% | 48 | 45 | 3 | 13.4% | 0.6% | 25 | 25 | 0 | 15.2% | 0.0% |
| 5 | 606 | 84 | 522 | 8.4% | 10.8% | 355 | 239 | 116 | 9.4% | 1.6% | 27 | 15 | 12 | 9.2% | 4.4% | 44 | 32 | 12 | 9.6% | 2.5% | 25 | 20 | 5 | 12.2% | 3.0% |
| 6 | 547 | 94 | 453 | 9.4% | 9.4% | 400 | 208 | 192 | 8.2% | 2.7% | 41 | 22 | 19 | 13.5% | 7.0% | 51 | 34 | 17 | 10.1% | 3.5% | 26 | 18 | 8 | 11.0% | 4.8% |
| 7 | 725 | 136 | 589 | 13.6% | 12.2% | 432 | 190 | 242 | 7.5% | 3.4% | 43 | 18 | 25 | 11.0% | 9.3% | 63 | 32 | 31 | 9.6% | 6.4% | 20 | 7 | 13 | 4.3% | 7.8% |
| 8 | 721 | 132 | 589 | 13.2% | 12.2% | 586 | 146 | 440 | 5.8% | 6.2% | 52 | 17 | 35 | 10.4% | 13.0% | 69 | 35 | 34 | 10.4% | 7.1% | 34 | 18 | 16 | 11.0% | 9.6% |
| 9 | 767 | 110 | 657 | 11.0% | 13.6% | 737 | 95 | 642 | 3.7% | 9.0% | 61 | 18 | 43 | 11.0% | 15.9% | 107 | 36 | 71 | 10.7% | 14.7% | 34 | 19 | 15 | 11.6% | 9.0% |
| 10 | 574 | 118 | 456 | 11.8% | 9.4% | 1060 | 79 | 981 | 3.1% | 13.8% | 50 | 14 | 36 | 8.6% | 13.3% | 110 | 13 | 97 | 3.9% | 20.1% | 56 | 8 | 48 | 4.9% | 28.9% |
| 11 | 414 | 69 | 345 | 6.9% | 7.1% | 1906 | 29 | 1877 | 1.1% | 26.3% | 52 | 16 | 36 | 9.8% | 13.3% | 131 | 6 | 125 | 1.8% | 25.9% | 50 | 12 | 38 | 7.3% | 22.9% |
| 12 | 363 | 115 | 248 | 11.5% | 5.1% | 2563 | 15 | 2548 | 0.6% | 35.7% | 45 | 13 | 32 | 8.0% | 11.9% | 95 | 8 | 87 | 2.4% | 18.0% | 23 | 2 | 21 | 1.2% | 12.7% |
The distribution of pairs (total, positive and negative) in terms of the number of kernels that classify them correctly. Results shown for each corpus separately. Aggregated results are shown in Figure 2. All but the PT kernel are considered. (PT is extremely slow and provide below average results).
Figure 3Heatmap of success level correlation in CV and CL evaluations. Correlation ranges from 2 (cyan) through 63 (white) to 1266 (magenta) pairs. Hues are on logarithmic scale.
The overlap of the pairs that are the most difficult and the easiest to classify correctly by the collection of kernels using cross-validation (CV) and cross-learning (CL) settings
| | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| difficult | unknown | D CV | 537 | 1 077 | 41 | 82 | 39 | 1776 | 10.4 |
| | | D CL | 628 | 1 003 | 35 | 99 | 37 | 1802 | 10.6 |
| | | D = | 105 | 530 | 8 | 28 | 0 | 671 | 3.9 |
| | | p-value | 10−10 | 10−281 | 10−8 | | | ||
| | positive | PD CV | 162 | 281 | 20 | 32 | 17 | 512 | 12.2 |
| | | PD CL | 142 | 319 | 15 | 26 | 16 | 518 | 12.3 |
| | | PD = | 61 | 111 | 2 | 9 | 7 | 4.5 | |
| | | p-value | 10−60 | 10−95 | 10−7 | 10−6 | | | |
| | negative | ND CV | 463 | 610 | 37 | 50 | 39 | 1199 | 9.3 |
| | | ND CV | 557 | 644 | 32 | 37 | 28 | 1298 | 10.1 |
| | | ND = | 184 | 295 | 12 | 19 | 11 | 4.0 | |
| | | p-value | 10−76 | 10−204 | 10−6 | 10−15 | 10−4 | | |
| easy | unknown | E CV | 2137 | 1870 | 85 | 83 | 36 | 4211 | 24.7 |
| | | E CL | 777 | 2563 | 45 | 95 | 73 | 3558 | 20.8 |
| | | E = | 464 | 1017 | 23 | 20 | 4 | 1528 | 8.9 |
| | | p-value | 10−45 | 10−184 | 10−7 | 10−3 | | | |
| | positive | PE CV | 104 | 301 | 26 | 48 | 36 | 515 | 12.3 |
| | | PE CL | 115 | 364 | 29 | 27 | 22 | 557 | 13.3 |
| | | PE = | 49 | 147 | 6 | 10 | 7 | 5.2 | |
| | | p-value | 10−59 | 10−136 | 10−7 | | | ||
| | negative | NE CV | 2105 | 1752 | 59 | 94 | 23 | 4033 | 31.3 |
| | | NE CL | 593 | 2548 | 32 | 87 | 21 | 3281 | 25.5 |
| | | NE = | 440 | 1014 | 21 | 27 | 8 | 11.7 | |
| p-value | 10−88 | 10−215 | 10−12 | 10−7 | 10−5 | ||||
We also indicated the size of each set, because they vary depending on the size of success level classes. Abbreviations D, E, PD, ND, PE, and NE refer to the set of difficult (unknown class label), easy (unknown class label), positive difficult, negative difficult, positive easy and negative easy pairs, respectively; GT means ground truth. We highlighted with bold the number pairs in the intersection of CV and CL settings. We show the p-value of Fisher’s independence χ2-test rounded to the closest factor of 10. Bold typesetting indicates that the size of the overlap is too low.
Classification results on the 521 ND pairs with CV evaluation
| edit | 18.1 | 305 | 427 | 0.71 | 0.59 |
| lexical | 25.0 | 203 | 391 | 0.52 | 0.39 |
| SST | 26.6 | 186 | 382 | 0.49 | 0.36 |
| APG | 25.3 | 185 | 389 | 0.48 | 0.36 |
| PT | 27.9 | 185 | 376 | 0.49 | 0.36 |
| syntactic | 24.4 | 180 | 394 | 0.46 | 0.35 |
| cosine | 24.9 | 168 | 391 | 0.43 | 0.32 |
| ST | 28.0 | 160 | 375 | 0.43 | 0.30 |
| shallow | 24.6 | 136 | 393 | 0.35 | 0.26 |
| kBSPS | 36.6 | 122 | 330 | 0.37 | 0.23 |
| combined | 24.8 | 117 | 392 | 0.30 | 0.22 |
| SL | 30.4 | 116 | 363 | 0.32 | 0.22 |
| SpT | 46.4 | 88 | 279 | 0.32 | 0.17 |
Classification results on the 521 ND pairs with CV evaluation (in decreasing order according to the number of successfully classified pairs). Ratio (r) refers to the distribution of positive classes predicted by the kernel measured across the 5 corpora; TN is the number of correctly classified ND pairs; e is 521·(1−r), the expected number of negative class predictions projected onto the 521 ND pairs.
Classification results on the 521 ND pairs with CL evaluation
| SST | 26.9 | 288 | 381 | 0.76 | 0.55 |
| edit | 22.5 | 279 | 404 | 0.69 | 0.54 |
| ST | 29.2 | 231 | 369 | 0.63 | 0.44 |
| APG | 26.9 | 207 | 381 | 0.54 | 0.40 |
| SL | 29.9 | 177 | 365 | 0.48 | 0.34 |
| lexical | 24.5 | 170 | 393 | 0.43 | 0.33 |
| cosine | 26.6 | 157 | 382 | 0.41 | 0.30 |
| syntactic | 26.9 | 155 | 381 | 0.41 | 0.30 |
| SpT | 42.1 | 142 | 302 | 0.47 | 0.27 |
| combined | 26.8 | 132 | 381 | 0.35 | 0.25 |
| shallow | 28.6 | 127 | 372 | 0.34 | 0.24 |
| kBSPS | 37.1 | 120 | 328 | 0.37 | 0.23 |
Classification results on the 521 ND pairs with CL evaluation (in decreasing order according to the number of successfully classified pairs). Ratio (r) refers to the distribution of positive classes predicted by the kernel measured across the 5 corpora; TN is the number of correctly classified ND pairs; e is 521·(1−r), the expected number of negative class predictions projected onto the 521 ND pairs.
Classification results on the 190 PD pairs with CV evaluation
| SpT | 46.4 | 71 | 88 | 0.81 | 0.37 |
| PT | 27.9 | 33 | 53 | 0.62 | 0.17 |
| kBSPS | 36.6 | 22 | 70 | 0.31 | 0.12 |
| ST | 28.0 | 19 | 53 | 0.36 | 0.10 |
| SST | 26.6 | 16 | 51 | 0.31 | 0.08 |
| APG | 25.3 | 15 | 48 | 0.31 | 0.08 |
| SL | 30.4 | 15 | 58 | 0.26 | 0.08 |
| syntactic | 24.4 | 14 | 46 | 0.30 | 0.07 |
| edit | 18.1 | 11 | 34 | 0.32 | 0.06 |
| lexical | 25.0 | 9 | 47 | 0.19 | 0.05 |
| shallow | 24.6 | 7 | 47 | 0.15 | 0.04 |
| cosine | 24.9 | 7 | 47 | 0.15 | 0.04 |
| combined | 24.8 | 4 | 47 | 0.09 | 0.02 |
Classification results on the 190 PD pairs with CV evaluation (in decreasing order according to the number of successfully classified pairs). Ratio (r) refers to the distribution of positive classes predicted by the kernel measured across the 5 corpora; TP is the number of correctly classified PD pairs; e is 190·r, the expected number of negative class predictions projected onto the 190 PD pairs.
Classification results on the 190 PD pairs with CL evaluation
| SpT | 42.1 | 53 | 80 | 0.66 | 0.28 |
| SST | 26.9 | 39 | 51 | 0.76 | 0.21 |
| ST | 29.2 | 28 | 55 | 0.51 | 0.15 |
| SL | 29.9 | 27 | 57 | 0.47 | 0.14 |
| combined | 26.8 | 16 | 51 | 0.31 | 0.08 |
| shallow | 28.6 | 14 | 54 | 0.26 | 0.07 |
| kBSPS | 37.1 | 14 | 70 | 0.20 | 0.07 |
| APG | 26.9 | 9 | 51 | 0.18 | 0.05 |
| edit | 22.5 | 7 | 43 | 0.16 | 0.04 |
| cosine | 26.6 | 4 | 51 | 0.08 | 0.02 |
| syntactic | 26.9 | 2 | 51 | 0.04 | 0.01 |
| lexical | 24.5 | 1 | 47 | 0.02 | 0.01 |
Classification results on the 190 PD pairs with CL evaluation (in decreasing order according to the number of successfully classified pairs). Ratio (r) refers to the distribution of positive classes predicted by the kernel measured across the 5 corpora; TP is the number of correctly classified PD pairs; e is 190·r, the expected number of negative class predictions projected onto the 190 PD pairs.
Classification results on the 1510 NE pairs with CV evaluation
| APG | 25.3 | 1510 | 0 | 1129 |
| cosine | 24.9 | 1510 | 0 | 1134 |
| edit | 18.1 | 1510 | 0 | 1237 |
| combined | 24.8 | 1510 | 0 | 1135 |
| shallow | 24.6 | 1510 | 0 | 1138 |
| syntactic | 24.4 | 1510 | 0 | 1142 |
| kBSPS | 36.6 | 1509 | 1 | 957 |
| SL | 30.4 | 1508 | 2 | 1051 |
| lexical | 25.0 | 1506 | 4 | 1133 |
| PT | 27.9 | 1505 | 5 | 1089 |
| ST | 28.0 | 1502 | 8 | 1088 |
| SST | 26.6 | 1501 | 9 | 1108 |
| SpT | 46.4 | 1484 | 26 | 810 |
Classification results on the 1510 NE pairs with CV evaluation (in decreasing order according to the successfully classified pairs). Ratio (r) refers to the distribution of positive classes predicted by the kernel measured across the 5 corpora; TN/FN is the number of correctly/incorrectly classified NE pairs; e is 1510·(1−r), the expected number of negative class prediction projected onto the 1510 NE pairs.
Classification results on the 1510 NE pairs with CL evaluation
| shallow | 28.6 | 1510 | 0 | 1078 |
| combined | 26.8 | 1505 | 5 | 1105 |
| APG | 26.9 | 1504 | 6 | 1104 |
| SL | 29.9 | 1504 | 6 | 1059 |
| lexical | 24.5 | 1501 | 9 | 1140 |
| kBSPS | 37.1 | 1494 | 16 | 950 |
| edit | 22.5 | 1491 | 19 | 1171 |
| cosine | 26.6 | 1490 | 20 | 1109 |
| ST | 29.2 | 1489 | 21 | 1069 |
| SST | 26.9 | 1484 | 26 | 1104 |
| syntactic | 26.9 | 1483 | 27 | 1103 |
| SpT | 42.1 | 1429 | 81 | 874 |
Classification results on the 1510 NE pairs with CL evaluation (in decreasing order according to the successfully classified pairs). Ratio (r) refers to the distribution of positive classes predicted by the kernel measured across the 5 corpora; TN/FN is the number of correctly/incorrectly classified NE pairs; e is 1510·(1−r), the expected number of negative class prediction projected onto the 1510 NE pairs.
Classification results on the 219 PE pairs with CV evaluation
| combined | 24.8 | 218 | 1 | 54 |
| APG | 25.3 | 218 | 1 | 55 |
| SpT | 46.4 | 218 | 1 | 102 |
| kBSPS | 36.6 | 217 | 2 | 80 |
| SL | 30.4 | 216 | 3 | 67 |
| shallow | 24.6 | 213 | 6 | 54 |
| PT | 27.9 | 210 | 9 | 61 |
| syntactic | 24.4 | 208 | 11 | 53 |
| cosine | 24.9 | 206 | 13 | 55 |
| ST | 28.0 | 205 | 14 | 61 |
| lexical | 25.0 | 204 | 15 | 55 |
| SST | 26.6 | 201 | 18 | 58 |
| edit | 18.1 | 192 | 27 | 40 |
Classification results on the 219 PE pairs with CV evaluation (in decreasing order according to the successfully classified pairs). Ratio (r) refers to the distribution of positive classes predicted by the kernel measured across the 5 corpora; TP/FP is the number of correctly/incorrectly classified PE pairs; e is 219·r, the expected number of positive class prediction projected onto the 219 PE pairs.
Classification results on the 219 PE pairs with CL evaluation
| kBSPS | 37.1 | 218 | 1 | 81 |
| combined | 26.8 | 217 | 2 | 59 |
| shallow | 28.6 | 205 | 14 | 63 |
| SL | 29.9 | 202 | 17 | 65 |
| syntactic | 26.9 | 202 | 17 | 59 |
| lexical | 24.5 | 196 | 23 | 54 |
| APG | 26.9 | 194 | 25 | 59 |
| cosine | 26.6 | 181 | 38 | 58 |
| SpT | 42.1 | 177 | 42 | 92 |
| edit | 22.5 | 154 | 65 | 49 |
| ST | 29.2 | 126 | 93 | 64 |
| SST | 26.9 | 123 | 96 | 59 |
Classification results on the 219 PE pairs with CL evaluation (in decreasing order according to the successfully classified pairs). Ratio (r) refers to the distribution of positive classes predicted by the kernel measured across the 5 corpora; TP/FP is the number of correctly/incorrectly classified PE pairs; e is 219·r, the expected number of positive class prediction projected onto the 219 PE pairs.
Figure 4Characteristics of pairs by difficulty class. Characteristics of pairs by difficulty class (average sentence length in words, average word distance between entities, average distance in the dependency graph (DG) and syntax tree (ST) shortest path). ND – negative difficult, NN – negative neutral, NE – negative easy, PD – positive difficult, PN – positive neutral, PE – positive easy.
Figure 5The number of positive and negative pairs vs. the length of the sentence containing the pair.
Figure 6The positive ground truth rate vs. the length of the sentence containing the pair.
Figure 7Class distribution of pairs depending on the number of proteins in the sentence.
Classification of difficulty classes based on pair surface features by decision tree
| | |||||||
|---|---|---|---|---|---|---|---|
| difficult (D) | 43.5 | 20.8 | 28.2 | 148 | 543 | 20 | 711 |
| neutral (N) | 92.0 | 96.2 | 94.1 | 178 | 14 090 | 372 | 14 640 |
| easy (E) | 72.6 | 60.0 | 65.7 | 14 | 678 | 1 037 | 1 729 |
| Total | 88.0 | 89.4 | 88.5 | ||||
Classification by the Weka J48 classifier. Confusion matrix columns correspond to predicted classes.
Incorrectly annotated protein pairs selected from the very hardest positive and negative pairs
| B.d267.s0.p14 | T | indirect | However, a number of mammalian DNA repair proteins lack NLS clusters; these proteins include ERCC1, ERCC2 (XPD), mouse RAD51, and the |
| B.d418.s0.p0 | T | functional | Membranous staining and concomitant cytoplasmic localization of E-cadherin, |
| B.d418.s0.p1 | T | functional | Membranous staining and concomitant cytoplasmic localization of |
| B.d506.s0.p8 | T | enumeration | Quantitation of the appearance of X22 banding in primary cultures of myotubes indicates that it precedes that of other myofibrillar proteins and that assembly takes place in the following order: |
| B.d833.s0.p15 | T | functional | Within 1 hour of raising the concentration of calcium ions, integrins, cadherins, alpha-catenin, beta-catenin, plakoglobin, vinculin and alpha-actinin appeared to accumulate at cell-cell borders, whereas the focal contact proteins, |
| B.d833.s0.p14 | T | functional | Within 1 hour of raising the concentration of calcium ions, |
| B.d594.s0.p0 | T | functional | The clone contains an open reading frame of 139 amino acid residues which shows greater than 40% sequence identity in a 91 amino acid overlap to animal actin-depolymerizing factors ( |
| B.d296.s2.p20 | T | functional | In normal livers, E-cad, |
| B.d296.s2.p25 | T | functional | In normal livers, E-cad, |
| B.d541.s0.p0 | T | functional | Since both |
| B.d546.s0.p20 | T | functional | Specific antibodies to |
| A.d28.s234.p1 | T | coreference | We have identified a new TNF-related ligand, designated human |
| B.d765.s0.p14 | T | enumeration | To determine the relationship between cell cycle regulation and differentiation, the spatiotemporal expression of cyclin A, cyclin B1, cyclin D1, the |
| B.d296.s2.p23 | T | functional | In normal livers, |
| B.d267.s0.p18 | T | indirect | However, a number of mammalian DNA repair proteins lack NLS clusters; these proteins include ERCC1, ERCC2 (XPD), mouse RAD51, and the HHR23B/ |
| B.d833.s0.p35 | T | functional | Within 1 hour of raising the concentration of calcium ions, |
| B.d765.s0.p10 | T | enumeration | To determine the relationship between cell cycle regulation and differentiation, the spatiotemporal expression of cyclin A, cyclin B1, cyclin D1, the cyclin-dependent kinase inhibitors ( |
| B.d833.s0.p34 | T | functional | Within 1 hour of raising the concentration of calcium ions, |
| B.d506.s0.p4 | T | enumeration | Quantitation of the appearance of X22 banding in primary cultures of myotubes indicates that it precedes that of other myofibrillar proteins and that assembly takes place in the following order: X22, titin, |
| B.d833.s0.p7 | T | functional | Within 1 hour of raising the concentration of calcium ions, |
| B.d506.s0.p11 | T | enumeration | Quantitation of the appearance of X22 banding in primary cultures of myotubes indicates that it precedes that of other myofibrillar proteins and that assembly takes place in the following order: X22, |
| B.d833.s0.p29 | T | functional | Within 1 hour of raising the concentration of calcium ions, |
| B.d833.s0.p32 | T | functional | Within 1 hour of raising the concentration of calcium ions, |
| A.d60.s528.p0 | F | T | The |
| B.d180.s0.p0 | F | T | |
| A.d114.s961.p0 | F | T | |
| B.d93.s0.p9 | F | T | Because |
| B.d749.s0.p2 | F | T | Three actin-associated proteins, actin-binding protein, |
| B.d639.s0.p0 | F | T | The main inhibitory action of p27, a cyclin-dependent kinase inhibitor ( |
| B.d334.s0.p0 | F | T | In extracts from mouse brain, |
| A.d141.s1189.p0 | F | T | The cyclin-dependent kinase |
| B.d485.s0.p2 | F | T | PF4-dependent downregulation of cyclin E-cdk2 activity was associated with increased binding of the |
| A.d157.s1329.p4 | F | T | Deletion analysis and binding studies demonstrate that a third enzyme, protein kinase C ( |
| A.d60.s529.p0 | F | T | Furthermore, a bacterially expressed |
| A.d199.s1701.p0 | F | T | |
| A.d161.s1355.p0 | F | T | |
| B.d357.s0.p1 | F | T | |
| A.d195.s1663.p2 | F | T | Intriguingly, NR1- |
| A.d151.s1288.p1 | F | T | Immunoprecipitation assays also show a weak substoichiometric association of the |
| B.d485.s0.p4 | F | T | PF4-dependent downregulation of cyclin E-cdk2 activity was associated with increased binding of the |
| B.d814.s0.p26 | F | T | We have shown that the |
| B.d14.s0.p4 | F | T | Actin-binding proteins such as |
| A.d39.s340.p0 | F | indirect | Chloramphenicol acetyltransferase assays in F9 cells showed that |
| B.d307.s0.p4 | F | indirect | In Acanthamoeba |
| B.d35.s4.p9 | F | indirect | We conclude that Aip1p is a |
| L.d35.s1.p1 | F | indirect | Our data demonstrate that the |
| B.d14.s1.p2 | F | indirect | These studies suggest that profilin and |
| I.d11.s28.p1 | F | coreference | The |
| L.d13.s0.p1 | F | indirect | Production of |
| A.d78.s669.p2 | F | indirect | Our data suggest that |
| B.d223.s0.p9 | F | functional | Furthermore, the deletion of |
Pair id abbreviations: A – AIMed; B – BioInfer; I – IEPA, L – LLL; ground truth (GT): T (true), F (false); type of errors: indirect – no direct interaction between the entities are described; functional – only functional similarity between entities are described; enumeration – entities are just listed together in an enumeration; coreference – the same protein with different referencing. Entities (in the pair) are highlighted with bold typeface.
The effect on F-score when changing the ground truth of incorrectly annotated pairs with APG and SL kernels
| | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| APG (setting A) | 56.18 | 56.61 | 56.14 | 0.43 | −0.47 | 60.66 | 60.87 | 0.21 | 0.32 | |
| APG (setting B) | 55.29 | 55.73 | 0.44 | 0.99 | 60.61 | 60.83 | 60.94 | 0.22 | 0.11 | |
| APG (setting C) | 53.20 | 53.66 | 53.96 | 0.46 | 0.30 | 59.91 | 60.36 | 60.88 | 0.45 | 0.52 |
| APG (setting D) | 52.30 | 52.77 | 52.99 | 0.47 | 0.22 | 59.42 | 59.90 | 60.20 | 0.48 | 0.30 |
| APG (avg) | 54.24 | 54.69 | 54.95 | 0.45 | 0.26 | 60.15 | 60.60 | 60.80 | 0.34 | 0.31 |
| SL | 54.48 | 55.06 | 0.58 | 0.51 | 59.99 | 60.46 | 0.47 | 0.25 | ||
Modified – using the original model with modified ground truth; retrained – results of a model retrained on the modified ground truth; Δm-o – difference between modified and original; Δr-m – difference between retrained and modified.
Figure 8Similarity of kernels as dendrogram and heat map. Colors below the dendrogram indicate the parsing information used by a kernel. Similarity of kernel outputs ranges from full agreement (red) to 33% disagreement (yellow) on the five benchmark corpora. Clustering is performed with R’s hclust (http://stat.ethz.ch/R-manual/R-devel/library/stats/html/hclust.html).
Surface and parsing features generated from sentence text used for training non-kernel based classifiers
| surface | distance (word/char) | sentence length in characters |
| | | entity distance in words |
| | count | number of proteins in sentence |
| | negation clues (s/b/w/a) | negation word before entities |
| | hedge clues (s/b/w/a) | hedge word after entities |
| | enumeration clues (b) | comma between entities |
| | interaction word clues (s/b/w/a) | interaction word in sentence |
| | entity modifier (a) | -ing word after first entity |
| parsing | distance (graph) | length of syntax tree shortest path |
| | occurrence features (entire graph) | number of |
| | occurrence features (shortest path) | number of |
| | frequency features (entire graph) | relative frequency of |
| | frequency features (shortest path) | relative frequency of |
| entropy | Kullback–Leibler divergence of constituent types in the entire syntax tree |
Features may refer to both sentence and pair level characteristics. Parsing features were generated from both syntax and dependency parses. Scope of features are typically sentence (s), before entities (b), between entities (w), after entities (a).
The ten most important features related to difficult (D) and easy (E) classes measured by information gain
| | ||||||
|---|---|---|---|---|---|---|
| 1 | sentence length (char) | − | 0.0089 | label entropy in ST | + | 0.110 |
| 2 | label entropy in ST (SP) | − | 0.0086 | sentence length (char) | + | 0.090 |
| 3 | − | 0.0079 | label entropy in DG | + | 0.089 | |
| 4 | # of proteins in sentence | − | 0.0078 | − | 0.081 | |
| 5 | sentence length (word) | − | 0.0069 | − | 0.079 | |
| 6 | − | 0.0069 | − | 0.076 | ||
| 7 | − | 0.0066 | − | 0.073 | ||
| 8 | − | 0.0066 | − | 0.069 | ||
| 9 | − | 0.0059 | − | 0.063 | ||
| 10 | − | 0.0057 | − | 0.062 | ||
IG – information gain; ST – syntax tree; DG – dependency graph; SP – shortest path. Italic typesetting indicates parsing tree labels. The sign after each feature indicates positive/negative correlation.
Figure 9Comparison of some non-kernel based and kernel based classifiers in terms of F-score (CV evaluation). The first 9 are non-kernel based classifiers, the last four are kernel based classifiers.
Results of some simple majority vote ensembles and comparison with best single methods in terms of F-score
| APG | AIMed | 59.9 | 53.6 | 56.2 |
| APG | BioInfer | 60.2 | 61.3 | 60.7 |
| kBSPS | HPRD50 | 60.0 | 70.2 | |
| APG | IEPA | 66.6 | 82.6 | 73.1 |
| kBSPS | LLL | 69.9 | 95.9 | 79.3 |
| APG+SL+kBSPS | AIMed | 58.0 | ||
| | BioInfer | 60.3 | 66.4 | |
| | HPRD50 | 67.6 | 76.9 | |
| | IEPA | 68.6 | ||
| | LLL | 71.7 | 94.5 | 80.0 |
| APG+SL+BayesNet | AIMed | 55.9 | 60.3 | 57.6 |
| | BioInfer | 58.6 | ||
| | HPRD50 | 69.8 | 67.7 | |
| | IEPA | 79.9 | 74.5 | |
| | LLL | 92.9 | ||
| All 13 kernels | AIMed | 35.8 | 46.6 | |
| | BioInfer | 56.5 | 58.7 | |
| | HPRD50 | 65.4 | 69.3 | 66.1 |
| | IEPA | 70.5 | 78.8 | 73.7 |
| LLL | 69.6 | 79.5 |
Best values are typeset in bold.