| Literature DB >> 22151872 |
Gerold Schneider1, Simon Clematide, Fabio Rinaldi.
Abstract
BACKGROUND: This article describes the approaches taken by the OntoGene group at the University of Zurich in dealing with two tasks of the BioCreative III competition: classification of articles which contain curatable protein-protein interactions (PPI-ACT) and extraction of experimental methods (PPI-IMT).Entities:
Mesh:
Substances:
Year: 2011 PMID: 22151872 PMCID: PMC3269936 DOI: 10.1186/1471-2105-12-S8-S13
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Termness and p(method|word)
| Method termness | ||||
|---|---|---|---|---|
| Probability | term word | Probability | word | method |
| 0.831498470948 | anti | 0.490056818 | L1 | MI:0006 |
| 0.692307692307692 | pooling | 0.47027027 | LT | MI:0019 |
| 0.662971175166297 | hybrid | 0.447269303 | ERK1/2 | MI:0006 |
| 0.519792083166733 | x-ray | 0.443877551 | hydrogen-bonding | MI:0114 |
| 0.515198153135822 | coimmunoprecipitation | 0.441441441 | omit | MI:0114 |
| 0.484276729559748 | coip | 0.43876567 | synapses | MI:0006 |
| 0.469194312796209 | bret | 0.436363636 | tumours | MI:0006 |
| 0.396292409933543 | fret | 0.435114504 | REFMAC | MI:0114 |
| 0.369761273209549 | tag | 0.430695698 | p21 | MI:0006 |
| 0.367924528301887 | tomography | 0.424657534 | COOT | MI:0114 |
| 0.35606936416185 | bifc | 0.423558897 | epithelium | MI:0006 |
| 0.35405192761605 | diffraction | 0.418918919 | flower | MI:0018 |
| 0.329399141630901 | resonance | 0.417443409 | IKK | MI:0006 |
| 0.322784810126582 | epr | 0.412797992 | caspase-3 | MI:0006 |
| 0.322607959356478 | crystallography | 0.407843137 | NF-kB | MI:0006 |
| 0.312878528168209 | two-hybrid | 0.406961178 | floral | MI:0018 |
| 0.311203319502075 | 2-hybrid | 0.406926407 | 9.00E+10 | MI:0007 |
| 0.307599517490953 | itc | 0.404040404 | diffracted | MI:0114 |
| 0.307372793354102 | spr | 0.40311174 | atom | MI:0114 |
| 0.303317535545024 | biosensor | 0.403057679 | HIV-1 | MI:0007 |
| 0.300881858902576 | two | 0.40167364 | wwwpdborg | MI:0114 |
| 0.300359712230216 | saxs | 0.401408451 | CCP4 | MI:0114 |
| 0.296829971181556 | bimolecular | 0.39668175 | BK | MI:0006 |
| 0.296758104738155 | plasmon | 0.39629241 | FRET | MI:0055 |
| 0.283073367995378 | bait | 0.394624313 | MCF-7 | MI:0006 |
| 0.282754418037782 | fluorescence | 0.394136808 | contoured | MI:0114 |
| 0.282689623080503 | nmr | 0.39047619 | Å | MI:0114 |
| 0.272583201267829 | isothermal | 0.389684814 | hypoxia | MI:0006 |
| 0.258223684210526 | calorimetry | 0.387915408 | c-Myc | MI:0007 |
| 0.258064516129032 | one-hybrid | 0.387096774 | PI3K | MI:0006 |
| 0.247863247863248 | crosslink | 0.385964912 | specification | MI:0018 |
| 0.238479262672811 | tap | 0.385809313 | seed | MI:0018 |
| 0.222466960352423 | phage | 0.38559322 | 15N | MI:0077 |
| 0.21827744904668 | scattering | 0.384858044 | colorectal | MI:0006 |
| 0.214154411764706 | pull | 0.384114583 | Å2 | MI:0114 |
| 0.211344922232388 | force | 0.38247012 | carboxylate | MI:0114 |
| 0.205298013245033 | bn-page | 0.38225925 | Src | MI:0006 |
| 0.203338930508912 | yeast | 0.381818182 | Argonne | MI:0114 |
| 0.181818181818182 | bioluminescence | 0.38125 | Floral | MI:0018 |
| 0.174830377336031 | kinase | 0.380802518 | Mdm2 | MI:0006 |
| 0.174038675261169 | down | 0.380634391 | carbonyl | MI:0114 |
Results and properties of our official runs.
| Run | Acc | Spec | Sens | F-Score | MCC | AUC iP/R | ME | Feat | DTH |
|---|---|---|---|---|---|---|---|---|---|
| Official run 1 | 88.68 | 97.64 | 38.57 | 50.83 | 0.48297 | 63.85 | + | WMP | 0.50 |
| Official run 2 | 87.93 | 93.06 | 59.23 | 59.82 | 0.52727 | 63.89 | + | WMP | 0.20 |
| Official run 3 | 67.05 | 64.19 | 83.08 | 43.34 | 0.34244 | 41.74 | – | P | |
| Official run 4 | 73.68 | 74.13 | 71.21 | 45.08 | 0.34650 | 41.74 | – | P | |
| Official run 5 | 88.00 | 94.40 | 52.20 | 56.89 | 0.50255 | 62.39 | + | WM | 0.25 |
| Post hoc run 2a | 86.90 | 90.57 | 66.37 | 60.58 | 0.53089 | 64.06 | + | WMP | 0.20 |
| Post hoc run 6 | 87.53 | 91.57 | 64.95 | 61.24 | 0.53969 | 66.30 | + | WMPBS | 0.21 |
Additionally we give our post hoc run 2a, performed without a minimal feature count threshold of 3 and bi-normal feature selection (20,000). Our best post hoc run 6 uses bigrams (B) and syntactic features (S). Features considered are W (bag of words), M (MeSH), P (PPIscore), B (bigrams), S (syntactic) - for a detailed description see page 3.
Comparison of mean results of a stratified 10-fold cross-validation experiment on the development set using oversampling versus using discretization threshold (DTH) lowering.
| Method | Acc | Spec | Sens | F-Score | MCC | AUC iP/R | Feat | DTH |
|---|---|---|---|---|---|---|---|---|
| Oversampling | 86.72 | 92.22 | 59.93 | 60.49 | 0.52580 | 66.36 | WMP | 0.50 |
| DTH lowering | 85.94 | 89.76 | 67.35 | 61.92 | 0.53689 | 69.33 | WMP | 0.20 |
| DTH lowering | 87.84 | 93.81 | 58.82 | 62.21 | 0.55363 | 69.33 | WMP | 0.36 |
No restrictions on minimal feature occurrence or feature set size were used in these experiments, therefore the DTH of 0.20 is no longer optimal. A DTH of 0.36 gives the best results overall. Features considered are W (bag of words), M (MeSH), P (PPIscore), B (bigrams), S (syntactic) - for a detailed description see page 3.
Figure 1Plots of mean MCC on the development set in a stratified 10-fold cross validation experiment with different feature sets. In the left panel mean MCC is a function of the percentile of class 1 decisions. In the right panel mean MCC is a function of varying DTHs.
Comparison of feature set quality using t-test on stratified 10-fold CV subsets from the development set at best performing percentiles (per) of class 1.
| CI | CI | p | EI | ||
|---|---|---|---|---|---|
| S @ 19 | 0.415, 0.480 | P @ 35 | 0.397, 0.454 | 0.0379 | 0.022 |
| B @ 13 | 0.460, 0.511 | M @ 13 | 0.370, 0.461 | 0.0082 | 0.070 |
| B @ 13 | 0.460, 0.511 | S @ 19 | 0.415, 0.480 | 0.0273 | 0.038 |
| W @ 20 | 0.491, 0.535 | B @ 13 | 0.460, 0.511 | 0.0116 | 0.028 |
| WS @ 22 | 0.510, 0.568 | W @ 20 | 0.491, 0.535 | 0.0043 | 0.026 |
| WMS @ 20 | 0.546, 0.577 | WS @ 22 | 0.510, 0.568 | 0.0414 | 0.023 |
| WMBS @ 17 | 0.549, 0.586 | WS @ 22 | 0.510, 0.568 | 0.0179 | 0.029 |
| WMPBS @ 17 | 0.558, 0.595 | WS @ 22 | 0.510, 0.568 | 0.0122 | 0.038 |
Features considered are W (bag of words), M (MeSH), P (PPIscore), B (bigrams), S (syntactic) - for a detailed description see page 3. Interpret the rows as follows (e.g. row 4): Feature set W has a 95% confidence interval (CI) of [0.491, 0.535], feature set B has one of [0.460, 0.511]. According to a t-test for dependent samples, feature set W is significantly better than feature set B (df=9; p=0.0116). The expected improvement (EI) of the MCC measure is at least 0.028 (95% confidence level). Notice, that feature set PBMSW or BMSW are not significantly better than MSW. For the case of combinations of 2, 3 or 4 different feature sets, only the best performing ones were selected in this table.
Performance Development of PPI-IMT system A
| IMT | 1. TermDict | 2. TermDict p(m|term) | 3. Word Corp | 4. run 1 submit | 5. Bigram p(m|bi) | 6. colloc chi2 | 7. without zoning | 8. comb. run 2 |
|---|---|---|---|---|---|---|---|---|
| Evaluated Res. | 4347 | 4334 | 2355 | 5098 | 11103 | 11094 | 15749 | 21600 |
| TP | 417 | 417 | 369 | 447 | 486 | 486 | 522 | 527 |
| FP | 3930 | 3917 | 1986 | 4651 | 10617 | 10608 | 15227 | 21073 |
| FN | 110 | 110 | 158 | 80 | 41 | 41 | 5 | 0 |
| Micro P | 0.09593 | 0.09622 | 0.15669 | 0.08768 | 0.04377 | 0.04381 | 0.03314 | 0.02440 |
| Micro R | 0.79127 | 0.79127 | 0.70019 | 0.84820 | 0.92220 | 0.92220 | 0.99051 | 1.00000 |
| Micro F | 0.17111 | 0.17157 | 0.25607 | 0.15893 | 0.08358 | 0.08364 | 0.06414 | 0.04763 |
| Micro AUC iP/R | 0.21694 | 0.26532 | 0.21633 | 0.27588 | 0.29466 | 0.29712 | 0.30205 | 0.30034 |
| Macro P | 0.10308 | 0.10333 | 0.16587 | 0.09346 | 0.04532 | 0.4537 | 0.03312 | 0.02440 |
| Macro R | 0.77590 | 0.77590 | 0.69459 | 0.83206 | 0.91261 | 0.91261 | 0.99174 | 1.00000 |
| Macro F | 0.17502 | 0.17542 | 0.25564 | 0.16322 | 0.08517 | 0.08525 | 0.06359 | 0.04735 |
| Macro AUC iP/R | 0.40387 | 0.46438 | 0.39722 | 0.47884 | 0.50159 | 0.50336 | 0.50630 | 0.50890 |
Performance by experimental method, for all methods where f(train) > 20
| Frequencies | Full Output | Thresholded | |||||
|---|---|---|---|---|---|---|---|
| f(train) | f(develop) | f(test) | P | R | P | R | |
| ALL | 4348 | 1379 | 527 | 4.38% | 92.22% | 15.70% | 80.46% |
| MI:0006 | 736 | 246 | 60 | 27.03% | 100% | 27.03% | 100% |
| MI:0007 | 728 | 212 | 66 | 29.73% | 100% | 29.73% | 100% |
| MI:0096 | 438 | 198 | 98 | 44.14% | 100% | 44.14% | 100% |
| MI:0018 | 403 | 85 | 30 | 13.51% | 100% | 13.51% | 100% |
| MI:0114 | 223 | 50 | 13 | 5.86% | 100% | 6.25% | 100% |
| MI:0416 | 172 | 83 | 61 | 27.48% | 100% | 27.59% | 91.80% |
| MI:0071 | 180 | 35 | 13 | 5.86% | 100% | 6.84% | 100% |
| MI:0424 | 416 | 44 | 15 | 6.76% | 100% | 7.41% | 93.33% |
| MI:0107 | 82 | 19 | 19 | 8.80% | 100% | 23.68% | 94.74% |
| MI:0663 | 68 | 35 | 2 | 0.93% | 100% | 0% | 0% |
| MI:0065 | 61 | 16 | 6 | 2.45% | 83.33% | 0% | 0% |
| MI:0077 | 58 | 11 | 7 | 3.33% | 100% | 20.69% | 85.71% |
| MI:0028 | 51 | 9 | 1 | 0.47% | 100% | 0% | 0% |
| MI:0030 | 46 | 20 | 2 | 0.98% | 100% | 0% | 0% |
| MI:0676 | 45 | 15 | 8 | 3.69% | 100% | 16.67% | 100% |
| MI:0055 | 45 | 9 | 11 | 5.31% | 100% | 32.14% | 81.81% |
| MI:0809 | 41 | 14 | 0 | ||||
| MI:0415 | 40 | 22 | 2 | 1.04% | 100% | 0% | 0% |
| MI:0004 | 35 | 9 | 6 | 3.03% | 100% | 0% | 0% |
| MI:0029 | 34 | 11 | 6 | 3.16% | 100% | 0% | 0% |
| MI:0040 | 31 | 9 | 0 | ||||
| MI:0404 | 30 | 12 | 1 | 0.49% | 100% | 0% | 0% |
| MI:0051 | 29 | 9 | 2 | 1.10% | 100% | 0% | 0% |
| MI:0017 | 29 | 7 | 0 | ||||
| MI:0808 | 28 | 3 | 1 | 0.48% | 100% | 0% | 0% |
| MI:0047 | 28 | 12 | 5 | 2.50% | 100% | 0% | 0% |
| MI:0405 | 27 | 4 | 7 | 3.55% | 85.71% | 0% | 0% |
| MI:0049 | 27 | 13 | 1 | 0.47% | 100% | 0% | 0% |
| MI:0019 | 24 | 3 | 51 | 37.25% | 100% | 0% | 0% |
| MI:0410 | 23 | 0 | 0 | ||||
| MI:0413 | 21 | 13 | 0 | ||||
PPI-IMT Performance of the submitted runs
| IMT | run 1 (A) | run 2 (B) | run 3 (A) | run 4 (B) | run 5 (A+B) |
|---|---|---|---|---|---|
| Evaluated Results | 5098 | 21529 | 4576 | 666 | 21600 |
| TP | 447 | 527 | 431 | 223 | 527 |
| FP | 4651 | 21002 | 4145 | 443 | 21073 |
| FN | 80 | 0 | 96 | 304 | 0 |
| Micro P | 0.08768 | 0.02448 | 0.09419 | 0.33483 | 0.02440 |
| Micro R | 0.84820 | 1.00000 | 0.81784 | 0.42315 | 1.00000 |
| Micro F | 0.15893 | 0.04779 | 0.16892 | 0.37385 | 0.04763 |
| Micro AUC iP/R | 0.27588 | 0.24484 | 0.27727 | 0.14169 | 0.29016 |
| Macro P | 0.09346 | 0.02448 | 0.09992 | 0.33483 | 0.02440 |
| Macro R | 0.83206 | 1.00000 | 0.79377 | 0.42883 | 1.00000 |
| Macro F | 0.16322 | 0.04750 | 0.17163 | 0.35403 | 0.04735 |
| Macro AUC iP/R | 0.47884 | 0.44034 | 0.47650 | 0.30927 | 0.50111 |