| Literature DB >> 31182027 |
Kuan-Hsi Chen1, Tsai-Feng Wang2, Yuh-Jyh Hu3.
Abstract
BACKGROUND: Although various machine learning-based predictors have been developed for estimating protein-protein interactions, their performances vary with dataset and species, and are affected by two primary aspects: choice of learning algorithm, and the representation of protein pairs. To improve the performance of predicting protein-protein interactions, we exploit the synergy of multiple learning algorithms, and utilize the expressiveness of different protein-pair features.Entities:
Keywords: Gene ontology; Network topology; Protein-protein interaction; Stacked generalization
Mesh:
Year: 2019 PMID: 31182027 PMCID: PMC6558856 DOI: 10.1186/s12859-019-2907-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Architecture of PPI-MetaGO for predicting protein–protein interactions
Values of the 14 physicochemical property scales of the 20 essential amino acids
| AA | H11a | H12a | H2 | NCI | P11a | P12a | P2 | SASA | V | F | A1 | E | T | A2 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A | 0.62 | 2.1 | − 0.5 | 0.007 | 8.1 | 0 | 0.046 | 1.181 | 27.5 | −1.27 | 0.49 | 15 | −0.8 | 1.064 |
| C | 0.29 | 1.4 | −1.0 | −0.037 | 5.5 | 1.48 | 0.128 | 1.461 | 44.6 | −1.09 | 0.26 | 5 | 0.83 | 1.412 |
| D | −0.9 | 10.0 | 3.0 | −0.024 | 13.0 | 40.7 | 0.105 | 1.587 | 40.0 | 1.42 | 0.78 | 50 | 1.65 | 0.866 |
| E | −0.74 | 7.8 | 3.0 | 0.007 | 12.3 | 49.91 | 0.151 | 1.862 | 62.0 | 1.6 | 0.84 | 55 | −0.92 | 0.851 |
| F | 1.19 | −9.2 | −2.5 | 0.038 | 5.2 | 0.35 | 0.29 | 2.228 | 115.5 | −2.14 | 0.42 | 10 | 0.18 | 1.091 |
| G | 0.48 | 5.7 | 0.0 | 0.179 | 9.0 | 0 | 0 | 0.881 | 0 | 1.86 | 0.48 | 10 | −0.55 | 0.874 |
| H | −0.4 | 2.1 | −0.5 | −0.011 | 10.4 | 3.53 | 0.23 | 2.025 | 79.0 | −0.82 | 0.84 | 56 | 0.11 | 1.105 |
| I | 1.38 | −8.0 | −1.8 | 0.022 | 5.2 | 0.15 | 0.186 | 1.81 | 93.5 | −2.89 | 0.34 | 13 | −1.53 | 1.152 |
| K | −1.5 | 5.7 | 3.0 | 0.018 | 11.3 | 49.5 | 0.219 | 2.258 | 100 | 2.88 | 0.97 | 85 | −1.06 | 0.93 |
| L | 1.06 | −9.2 | −1.8 | 0.052 | 4.9 | 0.45 | 0.186 | 1.931 | 93.5 | −2.29 | 0.4 | 16 | −1.01 | 1.25 |
| M | 0.64 | −4.2 | −1.3 | 0.003 | 5.7 | 1.43 | 0.221 | 2.034 | 94.1 | −1.84 | 0.48 | 20 | −1.48 | 0.826 |
| N | −0.78 | 7.0 | 2.0 | 0.005 | 11.6 | 3.38 | 0.134 | 1.655 | 58.7 | 1.77 | 0.81 | 49 | 3.0 | 0.776 |
| P | 0.12 | 2.1 | 0.0 | 0.240 | 8.0 | 0 | 0.131 | 1.468 | 41.9 | 0.52 | 0.49 | 15 | −0.8 | 1.064 |
| Q | −0.85 | 6.0 | 0.2 | 0.049 | 10.5 | 3.53 | 0.18 | 1.932 | 80.7 | 1.18 | 0.84 | 56 | 0.11 | 1.015 |
| R | −2.53 | 4.2 | 3.0 | 0.044 | 10.5 | 52.0 | 0.291 | 2.56 | 105 | 2.79 | 0.95 | 67 | −1.15 | 0.873 |
| S | −0.18 | 6.5 | 0.3 | 0.005 | 9.2 | 1.67 | 0.062 | 1.298 | 29.3 | 3.0 | 0.65 | 32 | 1.34 | 1.012 |
| T | −0.05 | 5.2 | −0.4 | 0.003 | 8.6 | 1.66 | 0.108 | 1.525 | 51.3 | 1.18 | 0.7 | 32 | 0.27 | 0.909 |
| V | 1.08 | −3.7 | −1.5 | 0.057 | 5.9 | 0.13 | 0.14 | 1.645 | 71.5 | −1.75 | 0.36 | 14 | −0.83 | 1.383 |
| W | 0.81 | −10 | −3.4 | 0.038 | 5.4 | 2.1 | 0.409 | 2.663 | 145.5 | −3.78 | 0.51 | 17 | −0.97 | 0.893 |
| Y | 0.26 | −1.9 | −2.3 | 117.3 | 6.2 | 1.61 | 0.298 | 2.368 | 0.024 | −3.3 | 0.76 | 41 | −0.29 | 1.161 |
H & H hydrophobicity, H hydrophilicity, NCI net charge index of side chains, P & P polarity, P polarizability, SASA solvent-accessible surface area, V volume of side chains, F Flexibility, A Accessibility, E Exposed, T Turns, A Antegenic
aHydrophobicity (H11 & H12) and polarity (P11 & P12) were calculated by two different methods
Fig. 2Vectorial representations of two proteins, P and P. a Each amino acid AA is first translated into a vector of 14 physicochemical scale values, b Both proteins, P and P, are later represented in a uniform vectorial form with 28 AC values. We demonstrate the calculation of the first two AC values of H11 for P when the gap is 1 (g = 1) and 2 (g = 2), respectively
in a given set of protein pairs. The found LCAs are stored in a list sorted by ascending order of their hierarchical GO level. For each LCA in the sorted list in ascending order, we iteratively group that LCA and all its descendants into a cluster, excluding those already assigned to a previously formed cluster. The entire GO DAG is consequently partitioned into a set of mutually exclusive subgraphs, each rooted by an LCA, as illustrated in Fig. 3. In the sample hierarchy of Fig. 3, the two protein pairs <P1,P2 > and < P5,P6 > share a common LCA (G11), which is denoted by LCA3. The LCA of protein pair <P3,P4 > (G4) is denoted by LCA2. The LCAs of protein pairs <P7,P8 > and < P9,P10 > (G15 and G1 respectively), are denoted by LCA4 and LCA1, respectively. These four LCAs are organized into a sorted list L in ascending order of their hierarchical levels, namely, L = (LCA4, LCA3, LCA2, LCA1). The first LCA in the sorted list, LCA4, is grouped with all its descendants in the hierarchy. The resulting cluster contains G15, G20, G21, G26, G27, G28, G33, G34, G35, G36, G42, G43, G44, and G45. Similarly, by grouping all the descendants from G11 (i.e. LCA3), we represent the second cluster of GO terms by a hierarchical subgraph rooted at G11. This subgroup contains 11 GO terms, including G11 itself. Continuing to the next LCA in the list, LCA2, we cluster all descendants of G4 (i.e. LCA2) that have not been assigned to an earlier cluster. Excluding the terms included in the second cluster, we form the third cluster of GO terms, constituting G4, G7, G8, G12, G18, G24, G25, G32, G40 and G41. Finally, based on LCA1, we group G1, G2, G3, G5, G6, G9, G10, G13, G14 and G19 into the fourth cluster. The entire hierarchy is consequently partitioned into four subgraphs, each corresponding to an LCA, based on the provided training set of protein pairs, namely, {
,
,
,
}. Provided with different training protein pairs, we can partition the hierarchy accordingly to reflect the different interaction characteristics of the protein pairs.
Fig. 3Demonstration of GO DAG partitioning into clusters based on LCAs
into numeric values of LCA-indexed GO-based features, we first locate the GO terms in sets G and G on each LCA-indexed subgraph. For each GO-term, we count the nodes along the ascending path up to the root of a subgraph, and sum the node counts on the subgraph. This sum is assigned as the value of the corresponding GO-term feature. Figure 4 shows the encoding of two protein pairs into two feature vectors, based on the four LCA-indexed GO-term features presented in Fig. 3. To obtain the LCA-indexed GO-term feature vector for the protein pair
, we locate the GO terms of P and P on the hierarchy. The GO terms G5 and G6 are located in the subgraph of LCA1, terms G7 and G8 are located in the subgraph of LCA2, and G20 is located in the subgraph of LCA4. The subgraph rooted at LCA3 contains no GO-term of either P or P. Tracing along the ascending paths from G5 and G6 up to LCA1 (blue arrows on the subgraph of LCA1 in Fig. 4), we encounter G5, G6, G3, and G1 (a total of four nodes). Therefore, the value of the LCA1-indexed GO-term feature is 4. Similarly, the values of the GO-term features indexed by LCA2 and LCA4 are determined as 3 and 2, respectively. As the subgraph of LCA3 contains no GO terms of either P or P, the Go-term features indexed by LCA3 are assigned a value of zero. Finally, the LCA-indexed GO-term feature vector for
is obtained as (2, 0, 3, 4). The GO terms of
are converted into a GO-term feature vector (0, 3, 3, 0) by the same process (see Fig. 4). Because the partitioning of the GO DAG depends on the given training data, the GO-based features of the same protein pair can vary in number and their values to adapt dynamically to the changes of training data. This flexibility warrants a better definition of GO-based features and leads to higher predictive performances when the size and the quality of training data increase.
Fig. 4Example of encoding protein pairs into LCA-indexed GO-term feature vectors. The blue and green arrows show the ascending traversals up to the LCAs from the GO terms of and < P, P>, respectively
and < P, P>, respectively
: (a) number of common neighbors, (b) the Jaccard index, (c) the Adamic–Adar index, (d) the preferential attachment score, and (e) the Otsuka–Ochiai coefficient [43, 44]. The network-based features are formally defined in Table 2. With the similar flexibility of the GO-based features, the network-based features of the same protein pair can be different and adapt when the training data change and so does the topology of the PPI network.
Summary of network-based features
| Features | Definitiona |
|---|---|
| Common neighbors | | |
| Jaccard index |
|
| Adamic–Adar index |
|
| Preferential attachment score | | |
| Otsuka–Ochiai coefficient |
|
aN(P) denotes the set of P’s neighbors
Summary of benchmark datasets
| Label | Species | Proteins | Interactions (positive/negative) | Prediction Tool |
|---|---|---|---|---|
| HS1 |
| 9439 | 37,027/37027 | PRED_PPI (Guo et al.) |
| EC1 |
| 1834 | 6954/6954 | PRED_PPI (Guo et al.) |
| DM1 |
| 7059 | 21,975/21975 | PRED_PPI (Guo et al.) |
| CE |
| 2640 | 4030/4030 | PRED_PPI (Guo et al.) |
| SC1 |
| 2245 | 3956/3956 | PRED_PPI (Guo et al.) |
| HS2a | Homo sapiens | 7033 | 24,718/177117 | SPRINT (Li & Ilie) |
| HS3 | Homo sapiens | 1515 | 12,244/12244 | TRI_tool (Perovic et al.) |
| SC2 | Saccharomyces cerevisiae | 3291 | 15,238/15238 | go2ppi-RF (Maetschke et al.) |
| HS4 | Homo sapiens | 3296 | 3490/3490 | go2ppi-RF (Maetschke et al.) |
| EC2 | Escherichia coli | 589 | 1167/1167 | go2ppi-RF (Maetschke et al.) |
| SP | Schizosaccharomyces pombe | 904 | 742/742 | go2ppi-RF (Maetschke et al.) |
| AT |
| 756 | 541/541 | go2ppi-RF (Maetschke et al.) |
| MM |
| 1088 | 500/500 | go2ppi-RF (Maetschke et al.) |
| DM2 | Drosophila melanogaster | 658 | 321/321 | go2ppi-RF (Maetschke et al.) |
| SC3 | Saccharomyces cerevisiae | 2152 | 3844/3844 | go2ppi-RF (Maetschke et al.) |
| HS5 | Homo sapiens | 6037 | 1091/3427 | HVSM (Zhang et al.) |
| SC4 | Saccharomyces cerevisiae | 5436 | 4529/10831 | HVSM (Zhang et al.) |
| SC5 | Saccharomyces cerevisiae | 454 | 500/500 | GIS-MaxEnt (Armean et al.) |
| SC6 | Saccharomyces cerevisiae | 4424 | 17,257/48594 | DeepSequencePPI (Gonzalez-Lopez et al.) |
aIn the work of SPRINT [6], the authors prepared three separate data into three human PPI data sets (i.e. C1, C2 and C3). To facilitate 10-fold CV in our experiments, we merged all three data sets into one single set of human PPI data with the redundancies removed
Summary of different PPI datasets for Homo sapiens, Saccharomyces cerevisiae, Escherichia coli, and Drosophila melanogaster. (a) the numbers of coincident proteins and (b) the numbers of coincident interacting and non-interacting protein pairs (Pos and Neg, respectively) in the datasets
| (a) | ||||||
| Protein | HS2 | HS3 | HS4 | HS5 | ||
| HS1 | 4513(11959a) | 971(9983) | 2460 (10275) | 2699 (12777) | ||
| HS2 | – | 1043 (7505) | 2272 (8057) | 2492 (10578) | ||
| HS3 | – | – | 620 (4191) | 616 (6936) | ||
| HS4 | – | – | – | 1472 (7861) | ||
| Protein | SC2 | SC3 | SC4 | SC5 | SC6 | |
| SC1 | 1759 (3777) | 2078 (2319) | 2088 (5593) | 0 (2699) | 1979 (4690) | |
| SC2 | – | 1762 (3681) | 3187 (5540) | 0 (3745) | 2622 (5093) | |
| SC3 | – | – | 2074 (5514) | 0 (2606) | 2001 (4574) | |
| SC4 | – | – | – | 0 (5890) | 3612 (6248) | |
| SC5 | – | – | – | – | 0 (4878) | |
| Protein | EC2 | |||||
| EC1 | 469 (1954) | |||||
| Protein | DM2 | |||||
| DM1 | 295 (7422) | |||||
| (b) | ||||||
| Pos | HS1 | HS2 | HS3 | HS4 | HS5 | |
| Neg | ||||||
| HS1 | – | 8388 (53357) | 2282 (46989) | 1626 (38891) | 514 (37604) | |
| HS2 | 87 (214057b) | – | 2742 (34220) | 1505 (26703) | 451 (25363) | |
| HS3 | 5 (49266) | 59 (189302) | – | 463 (15271) | 194 (13141) | |
| HS4 | 4 (40513) | 15 (180592) | 2 (15732) | – | 272 (4309) | |
| HS5 | 0 (40454) | 5 (180539) | 1 (15670) | 0 (6917) | – | |
| Pos | SC1 | SC2 | SC3 | SC4 | SC5 | SC6 |
| Neg | ||||||
| SC1 | – | 1985 (17236) | 3587 (4213) | 3372 (5113) | 0 (4456) | 3526 (17687) |
| SC2 | 4 (19190) | – | 2073 (17009) | 2534 (17233) | 0 (15738) | 4479 (28016) |
| SC3 | 10 (7790) | 8 (19074) | – | 3532 (4841) | 0 (4344) | 3728 (17373) |
| SC4 | 4 (14783) | 12 (26057) | 3 (14672) | – | 0 (5029) | 3602 (18184) |
| SC5 | 0 (4456) | 0 (15738) | 0 (4344) | 0 (11331) | – | 0 (17757) |
| SC6 | 43 (52507) | 76 (63756) | 28 (52410) | 42 (59383) | 0 (49094) | – |
| Pos | EC1 | EC2 | ||||
| Neg | ||||||
| EC1 | – | 384 (7737) | ||||
| EC2 | 3 (8118) | – | ||||
| Pos | DM1 | DM2 | ||||
| Neg | ||||||
| DM1 | – | 15 (22281) | ||||
| DM2 | 0 (22296) | – | ||||
HS Homo sapiens, SC Saccharomyces cerevisiae, EC Escherichia coli, DM Drosophila melanogaster aNumbers in parentheses are the total numbers of non-duplicated proteins in the two datasets, e.g. HS1 and HS2
bNumbers in parentheses are the total numbers of non-duplicated protein pairs in the two datasets, e.g. HS1 and HS2
Performance results of 10-fold CV of PPI prediction methods
| PPI-MetaGO | Other recent prediction tools | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dataset | TPR | FPR | Prec | ACC | F-score | MCC | AUC | TPR | FPR | Prec | ACC | F-score | MCC | AUC | Tool |
| HS1 | 0.964 | 0.013 | 0.987 | 0.975 | 0.975 | 0.951 | 0.993 | 0.835 | 0.046 | 0.948 | 0.895 | 0.888 | 0.795 | 0.900 | PRED_PPI |
| EC1 | 0.923 | 0.015 | 0.984 | 0.954 | 0.952 | 0.909 | 0.983 | 0.897 | 0.147 | 0.860 | 0.875 | 0.878 | 0.752 | 0.935 | PRED_PPI |
| DM1 | 0.966 | 0.010 | 0.990 | 0.978 | 0.978 | 0.956 | 0.996 | 0.750 | 0.223 | 0.771 | 0.763 | 0.760 | 0.527 | 0.841 | PRED_PPI |
| CE | 0.984 | 0.004 | 0.995 | 0.990 | 0.990 | 0.979 | 0.997 | 0.833 | 0.158 | 0.841 | 0.838 | 0.837 | 0.676 | 0.910 | PRED_PPI |
| SC1 | 0.898 | 0.051 | 0.947 | 0.923 | 0.921 | 0.848 | 0.974 | 0.686 | 0.342 | 0.667 | 0.672 | 0.676 | 0.344 | 0.737 | PRED_PPI |
| HS2 | 0.327 | 0.009 | 0.834 | 0.91 | 0.469 | 0.487 | 0.791 | 0.540 | 0.072 | 0.513 | 0.881 | 0.526 | 0.458 | 0.814 | SPRINT |
| HS3 | 0.826 | 0.187 | 0.816 | 0.820 | 0.821 | 0.639 | 0.897 | 0.789 | 0.193 | 0.803 | 0.798 | 0.796 | 0.596 | 0.878 | TRI_tool |
| SC2 | 0.858 | 0.059 | 0.936 | 0.899 | 0.895 | 0.802 | 0.952 | 0.819 | 0.076 | 0.915 | 0.872 | 0.864 | 0.747 | 0.921 | go2ppi-RF |
| HS4 | 0.826 | 0.106 | 0.887 | 0.860 | 0.855 | 0.723 | 0.921 | 0.786 | 0.126 | 0.863 | 0.830 | 0.822 | 0.663 | 0.890 | go2ppi-RF |
| EC2 | 0.879 | 0.075 | 0.922 | 0.902* | 0.900* | 0.805* | 0.950* | 0.869 | 0.059 | 0.937 | 0.905 | 0.902 | 0.813 | 0.951 | go2ppi-RF |
| SP | 0.922 | 0.065 | 0.935 | 0.929 | 0.928 | 0.858 | 0.965 | 0.865 | 0.096 | 0.901 | 0.885 | 0.882 | 0.771 | 0.941 | go2ppi-RF |
| AT | 0.778 | 0.163 | 0.830 | 0.808* | 0.801 | 0.619* | 0.866 | 0.684 | 0.105 | 0.875 | 0.789 | 0.764 | 0.596 | 0.810 | go2ppi-RF |
| MM | 0.754 | 0.182 | 0.808 | 0.786 | 0.779 | 0.575 | 0.860 | 0.604 | 0.128 | 0.836 | 0.738 | 0.695 | 0.500 | 0.762 | go2ppi-RF |
| DM2 | 0.857 | 0.118 | 0.885 | 0.869 | 0.867 | 0.744 | 0.916 | 0.832 | 0.146 | 0.853 | 0.843 | 0.841 | 0.688 | 0.889 | go2ppi-RF |
| SC3 | 0.786 | 0.104 | 0.883 | 0.841 | 0.831 | 0.686 | 0.894 | 0.707 | 0.120 | 0.858 | 0.794 | 0.774 | 0.598 | 0.826 | go2ppi-RF |
| HS5 | 0.824 | 0.026 | 0.911 | 0.938 | 0.864 | 0.826 | 0.974 | 0.782 | 0.213 | 0.801 | 0.784 | 0.609 | 0.578 | 0.849 | HVSM |
| SC4 | 0.773 | 0.036 | 0.901 | 0.908 | 0.832 | 0.773 | 0.945 | 0.707 | 0.213 | 0.777 | 0.747 | 0.581 | 0.505 | 0.797 | HVSM |
| SC5 | 0.920 | 0.034 | 0.965 | 0.943 | 0.942 | 0.888 | 0.984 | 0.926 | 0.088 | 0.915 | 0.919 | 0.920 | 0.839 | 0.977 | GIS-MaxEnt |
| SC6 | 0.912 | 0.064 | 0.934 | 0.924 | 0.923 | 0.848 | 0.972 | 0.920 | 0.078 | 0.942 | 0.932 | 0.931 | 0.864 | 0.978 | DeepSequencePPI |
TPR true positive rate, FPR false positive rate, Prec precision, ACC accuracy, MCC Matthews correlation coefficient, AUC area under ROC
HS Homo sapiens, EC Escherichia coli, DM Drosophila melanogaster, CE Caenorhabditis elegans, SC Saccharomyces cerevisiae, SP schizosaccharomyces pombe, AT Arabidopsis thaliana, MM Mus musculus
*denotes insignificant difference in a paired t-test between PPI-MetaGO and the prediction tool in the 10-fold CV at the significance level α = 0.05
AUCs of cross-species predictions of PPI-MetaGO/go2ppi-NB using the biological process (BP), cellular component (CC), and molecular function (MF) ontology, respectively
| BP |
| |||||||
| Train | AUC |
|
|
|
|
|
|
|
| EC2 | 0.94/0.88 | 0.92/0.78 | 0.86/0.76 | 0.87/0.77 | 0.69/0.80 | 0.73/0.64 | 0.59/0.65 | |
| SP | 0.87/0.65 | 0.96/0.81 | 0.88/0.74 | 0.87/0.75 | 0.68/0.74 | 0.78/0.55 | 0.60/0.61 | |
| HS4 | 0.90/0.72 | 0.94/0.75 | 0.95/0.76 | 0.88/0.73 | 0.71/0.80 | 0.76/0.64 | 0.63/0.68 | |
| SC2 | 0.89/0.80 | 0.95/0.79 | 0.90/0.76 | 0.95/0.79 | 0.76/0.83 | 0.73/0.67 | 0.58/0.70 | |
| DM2 | 0.83/0.60 | 0.92/0.70 | 0.85/0.71 | 0.87/0.67 | 0.79/0.78 | 0.68/0.63 | 0.58/0.60 | |
| AT | 0.82/0.72 | 0.91/0.80 | 0.84/0.74 | 0.86/0.75 | 0.73/0.76 | 0.86/0.72 | 0.61/0.63 | |
| MM | 0.81/0.62 | 0.87/0.71 | 0.85/0.73 | 0.85/0.71 | 0.70/0.74 | 0.69/0.56 | 0.73/0.69 | |
| CC |
| |||||||
| Train | AUC |
|
|
|
|
|
|
|
| EC2 | 0.94/0.88 | 0.91/0.67 | 0.86/0.68 | 0.87/0.68 | 0.68/0.74 | 0.71/0.59 | 0.55/0.66 | |
| SP | 0.85/0.55 | 0.96/0.82 | 0.88/0.70 | 0.88/0.78 | 0.66/0.73 | 0.73/0.61 | 0.56/0.56 | |
| HS4 | 0.89/0.70 | 0.94/0.70 | 0.95/0.80 | 0.88/0.77 | 0.71/0.79 | 0.75/0.65 | 0.66/0.68 | |
| SC2 | 0.89/0.78 | 0.95/0.74 | 0.90/0.76 | 0.94/0.83 | 0.75/0.80 | 0.72/0.65 | 0.58/0.64 | |
| DM2 | 0.82/0.64 | 0.91/0.69 | 0.84/0.74 | 0.87/0.79 | 0.81/0.80 | 0.70/0.63 | 0.58/0.60 | |
| AT | 0.79/0.57 | 0.90/0.66 | 0.84/0.67 | 0.85/0.73 | 0.67/0.70 | 0.85/0.71 | 0.61/0.61 | |
| MM | 0.76/0.70 | 0.87/0.71 | 0.86/0.74 | 0.85/0.77 | 0.68/0.77 | 0.61/0.61 | 0.70/0.70 | |
| MF |
| |||||||
| Train | AUC |
|
|
|
|
|
|
|
| EC2 | 0.94/0.88 | 0.92/0.65 | 0.87/0.62 | 0.87/0.66 | 0.69/0.70 | 0.74/0.62 | 0.57/0.56 | |
| SP | 0.85/0.81 | 0.97/0.76 | 0.87/0.65 | 0.87/0.67 | 0.68/0.72 | 0.76/0.67 | 0.58/0.57 | |
| HS4 | 0.89/0.85 | 0.94/0.78 | 0.95/0.76 | 0.88/0.68 | 0.72/0.76 | 0.75/0.67 | 0.63/0.68 | |
| SC2 | 0.88/0.89 | 0.95/0.73 | 0.89/0.66 | 0.95/0.76 | 0.75/0.75 | 0.74/0.59 | 0.55/0.62 | |
| DM2 | 0.79/0.80 | 0.92/0.68 | 0.85/0.65 | 0.86/0.66 | 0.82/0.79 | 0.72/0.67 | 0.57/0.60 | |
| AT | 0.75/0.72 | 0.93/0.70 | 0.83/0.63 | 0.83/0.60 | 0.72/0.70 | 0.86/0.75 | 0.61/0.58 | |
| MM | 0.79/0.77 | 0.88/0.66 | 0.86/0.67 | 0.85/0.65 | 0.67/0.74 | 0.71/0.70 | 0.72/0.67 | |
EC Escherichia coli, SP schizosaccharomyces pombe, HS Homo sapiens, SC Saccharomyces cerevisiae, DM Drosophila melanogaster, AT Arabidopsis thaliana, MM Mus musculus
Performance results in 10-fold CV of PPI-MetaGO with different feature combinations
| TPR | FPR | Prec | ACC | F-score | MCC | AUC | |
|---|---|---|---|---|---|---|---|
| HS1 | |||||||
| F1 | 0.917 | 0.016 | 0.983 | 0.951 | 0.949 | 0.903 | 0.981 |
| F2 | 0.872 | 0.101 | 0.897 | 0.886 | 0.884 | 0.772 | 0.901 |
| F3 | 0.686 | 0.637 | 0.521 | 0.525 | 0.588 | 0.053 | 0.534 |
| F1&F2 | 0.894 | 0.031 | 0.966 | 0.931 | 0.929 | 0.865 | 0.977 |
| F1&F3 | 0.926 | 0.028 | 0.971 | 0.949 | 0.948 | 0.899 | 0.987 |
| F2&F3 | 0.885 | 0.089 | 0.909 | 0.898 | 0.896 | 0.796 | 0.915 |
| F1&F2&F3 | 0.964 | 0.013 | 0.987 | 0.975 | 0.975 | 0.951 | 0.993 |
| DM1 | |||||||
| F1 | 0.978 | 0.001 | 0.999 | 0.988 | 0.988 | 0.977 | 0.997 |
| F2 | 0.664 | 0.237 | 0.727 | 0.714 | 0.644 | 0.449 | 0.787 |
| F3 | 0.690 | 0.604 | 0.417 | 0.543 | 0.497 | 0.132 | 0.538 |
| F1&F2 | 0.933 | 0.005 | 0.995 | 0.964 | 0.963 | 0.930 | 0.995 |
| F1&F3 | 0.977 | 0.001 | 0.999 | 0.988 | 0.988 | 0.976 | 0.998 |
| F2&F3 | 0.740 | 0.253 | 0.728 | 0.743 | 0.705 | 0.501 | 0.765 |
| F1&F2&F3 | 0.966 | 0.010 | 0.990 | 0.978 | 0.978 | 0.956 | 0.996 |
| HS3 | |||||||
| F1 | 0.812 | 0.215 | 0.790 | 0.798 | 0.801 | 0.597 | 0.862 |
| F2 | 0.730 | 0.244 | 0.750 | 0.743 | 0.740 | 0.487 | 0.788 |
| F3 | 0.626 | 0.235 | 0.731 | 0.695 | 0.672 | 0.397 | 0.733 |
| F1&F2 | 0.809 | 0.206 | 0.797 | 0.801 | 0.803 | 0.602 | 0.870 |
| F1&F3 | 0.812 | 0.191 | 0.809 | 0.811 | 0.811 | 0.621 | 0.883 |
| F2&F3 | 0.720 | 0.202 | 0.781 | 0.759 | 0.749 | 0.520 | 0.810 |
| F1&F2&F3 | 0.826 | 0.187 | 0.816 | 0.820 | 0.821 | 0.639 | 0.897 |
| SC2 | |||||||
| F1 | 0.747 | 0.261 | 0.741 | 0.743 | 0.744 | 0.486 | 0.812 |
| F2 | 0.805 | 0.133 | 0.858 | 0.836 | 0.831 | 0.673 | 0.871 |
| F3 | 0.796 | 0.063 | 0.927 | 0.866 | 0.856 | 0.740 | 0.885 |
| F1&F2 | 0.809 | 0.145 | 0.848 | 0.832 | 0.828 | 0.665 | 0.908 |
| F1&F3 | 0.841 | 0.070 | 0.923 | 0.885 | 0.880 | 0.774 | 0.933 |
| F2&F3 | 0.858 | 0.065 | 0.930 | 0.897 | 0.893 | 0.796 | 0.934 |
| F1&F2&F3 | 0.858 | 0.059 | 0.936 | 0.899 | 0.895 | 0.802 | 0.952 |
| EC2 | |||||||
| F1 | 0.763 | 0.169 | 0.821 | 0.797 | 0.790 | 0.598 | 0.845 |
| F2 | 0.810 | 0.089 | 0.902 | 0.860 | 0.853 | 0.725 | 0.913 |
| F3 | 0.878 | 0.075 | 0.922 | 0.902 | 0.899 | 0.805 | 0.938 |
| F1&F2 | 0.793 | 0.141 | 0.850 | 0.826 | 0.820 | 0.655 | 0.895 |
| F1&F3 | 0.901 | 0.069 | 0.929 | 0.916 | 0.914 | 0.832 | 0.956 |
| F2&F3 | 0.915 | 0.046 | 0.952 | 0.934 | 0.933 | 0.870 | 0.973 |
| F1&F2&F3 | 0.913 | 0.048 | 0.950 | 0.933 | 0.931 | 0.866 | 0.960 |
F physicochemical features, F LCA-indexed GO-term features, F network-based features
TPR true positive rate, FPR false positive rate, Prec precision, ACC accuracy, MCC Matthews correlation coefficient, AUC area under ROC
EC Escherichia coli, HS Homo sapiens, SC Saccharomyces cerevisiae, DM Drosophila melanogaster
Performance results of 10-fold CV, using mixed positive and negative data from different datasets
| PPI-MetaGO | Other recent prediction tools | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dataset | TPR | FPR | Prec | ACC | F-score | MCC | AUC | TPR | FPR | Prec | ACC | F-score | MCC | AUC | Tool |
HS5(+)a HS4(−) | 0.811 | 0.031 | 0.893 | 0.932 | 0.850 | 0.808 | 0.971 | 0.730 | 0.054 | 0.808 | 0.894 | 0.766 | 0.700 | 0.932 | HVSM |
SC6(+)b SC2(−) | 0.810 | 0.155 | 0.858 | 0.826 | 0.832 | 0.656 | 0.901 | 0.819 | 0.204 | 0.824 | 0.811 | 0.822 | 0.621 | 0.891 | DeepSequencePPI |
TPR true positive rate, FPR false positive rate, Prec precision, ACC accuracy, MCC Matthews correlation coefficient, AUC area under ROC
HS Homo sapiens, SC Saccharomyces cerevisiae
aCombination of positive data from HS5 and negative data from HS4
bCombination of positive data from SC6 and negative data from SC2
Numbers of F2 features generated in each run of 10-fold CV on HS1
| Run | Number of F2 (ontology CC) | Number of F2 (ontology BP) | Number of F2 (ontology MF) | Total F2 |
|---|---|---|---|---|
| 1 | 83 | 327 | 329 | 739 |
| 2 | 84 | 324 | 328 | 736 |
| 3 | 86 | 327 | 331 | 744 |
| 4 | 83 | 326 | 335 | 744 |
| 5 | 84 | 325 | 334 | 743 |
| 6 | 85 | 328 | 333 | 746 |
| 7 | 82 | 323 | 328 | 733 |
| 8 | 82 | 327 | 331 | 740 |
| 9 | 86 | 318 | 330 | 734 |
| 10 | 82 | 326 | 328 | 736 |
F LCA-indexed GO-term features
Average numbers of F2 features generated for HS1 to HS5 of H. sapiens
| Dataset | Avg Number of F2 (ontology CC) | Avg Number of F2 (ontology BP) | Avg Number of F2 (ontology MF) | Avg Total F2 |
|---|---|---|---|---|
| HS1 | 84 | 325 | 331 | 740 |
| HS2 | 103 | 448 | 469 | 1020 |
| HS3 | 28 | 29 | 54 | 111 |
| HS4 | 31 | 78 | 118 | 227 |
| HS5 | 33 | 105 | 134 | 272 |
F LCA-indexed GO-term features, HS Homo sapiens