| Literature DB >> 24661439 |
Calem J Bendell, Shalon Liu, Tristan Aumentado-Armstrong, Bogdan Istrate, Paul T Cernek, Samuel Khan, Sergiu Picioreanu, Michael Zhao, Robert A Murgita1.
Abstract
BACKGROUND: Transient protein-protein interactions (PPIs), which underly most biological processes, are a prime target for therapeutic development. Immense progress has been made towards computational prediction of PPIs using methods such as protein docking and sequence analysis. However, docking generally requires high resolution structures of both of the binding partners and sequence analysis requires that a significant number of recurrent patterns exist for the identification of a potential binding site. Researchers have turned to machine learning to overcome some of the other methods' restrictions by generalising interface sites with sets of descriptive features. Best practices for dataset generation, features, and learning algorithms have not yet been identified or agreed upon, and an analysis of the overall efficacy of machine learning based PPI predictors is due, in order to highlight potential areas for improvement.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24661439 PMCID: PMC4021185 DOI: 10.1186/1471-2105-15-82
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The true negative (TNr) and true positive rates (TPr) for Logistic Regression (LR), Baysian Network (BN), Multilayer Perceptron (MP), Functional Tree (FT), various classifiers depending on the proportion of instances labelled as interacting as evaluated by LOOCV on the total test set, where the true negative rates are universally descending and the true positive rates universally ascending.
The features selected by the iterative genetic search algorithm with evaluation by Logistic Regression (LR), Baysian Network (BN), Functional Tree (FT), REP Tree (RT), and Alternating Decision Tree (AT)
| relSESA | 10 | □∙ | □∙ | □∙ | □∙ | □∙ |
| esolv | 6 | □ | □∙ | ∙ | | □∙ |
| Density | 6 | ∙ | | ∙ | □∙ | □∙ |
| ePot | 5 | | □∙ | □ | □ | ∙ |
| Scorecons | 5 | □∙ | ∙ | □ | □ | |
| rate4site | 5 | □ | ∙ | ∙ | □ | |
| Disorder | 5 | □∙ | | ∙ | □ | □ |
| B-Factor | 4 | | ∙ | □∙ | | □ |
| Roughness | 4 | □∙ | □∙ | | | |
| Hydro | 3 | ∙ | | □ | □ | |
| Protrusion | 3 | □∙ | | | □ | |
| Propensity | 3 | □ | | □ | ∙ | |
| Curvature | 2 | □∙ |
A white box (□) represents the feature was selected when the algorithm was tested on all proteins, using leave-one-out cross-validation, while a feature with a black circle (∙) was selected when tested on the NI1 subset. Count is the number of times a feature was selected by either datasets tested.
F1, MCC, TPr, TNr, PRC using the full set of features ( ) or only the features selected ( ) in Table 1
| | | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| | | |||||||||||
| MCC | 0.1247 | 0.1304 | * | 0.1684 | 0.188 | *** | 0.1376 | 0.1418 | | 0.1859 | 0.2012 | *** |
| TPr | 0.6462 | 0.6497 | | 0.6449 | 0.6597 | * | 0.7064 | 0.7615 | *** | 0.7958 | 0.8047 | |
| TNr | 0.5386 | 0.5435 | ** | 0.5435 | 0.5519 | | 0.4915 | 0.4428 | | 0.4049 | 0.4149 | *** |
| PRC | 0.2082 | 0.2108 | * | 0.3445 | 0.3543 | *** | 0.2092 | 0.2053 | | 0.3328 | 0.3383 | *** |
| F1 | 0.2946 | 0.2971 | | 0.4359 | 0.447 | ** | 0.3019 | 0.304 | | 0.4562 | 0.4635 | *** |
| | | |||||||||||
| | | |||||||||||
| | | | | | ||||||||
| MCC | 0.1356 | 0.1412 | | 0.188 | 0.2043 | ** | 0.1319 | 0.1408 | ** | 0.1607 | 0.1981 | *** |
| TPr | 0.7794 | 0.8553 | *** | 0.7866 | 0.842 | *** | 0.704 | 0.7545 | *** | 0.6931 | 0.821 | *** |
| TNr | 0.4143 | 0.3341 | | 0.4197 | 0.3749 | | 0.4881 | 0.4504 | | 0.485 | 0.3905 | |
| PRC | 0.2008 | .1976 | | 0.3358 | 0.3341 | | 0.2059 | 0.2055 | | 0.335 | 0.3343 | |
| F1 | 0.3005 | 0.3007 | | 0.4574 | 0.4654 | ** | 0.2983 | 0.3034 | *** | 0.4385 | 0.4628 | *** |
| | | |||||||||||
| | | | ||||||||||
| | | | | | | | | | ||||
| MCC | 0.1059 | 0.1196 | *** | 0.128 | 0.1769 | *** | | | | | | |
| TPr | 0.6333 | 0.6583 | *** | 0.6246 | 0.6986 | *** | | | | | | |
| TNr | 0.5227 | 0.5173 | | 0.5221 | 0.502 | | | | | | | |
| PRC | 0.1999 | 0.2041 | ** | 0.327 | 0.343 | *** | | | | | | |
| F1 | 0.2843 | 0.2922 | *** | 0.417 | 0.4461 | *** | ||||||
Algorithm abbreviations in Table 1. The algorithms were applied to the full protein set (Full Set) and the NI1 subset. Statistical significance was calculated using using 1-sided Wilcoxon’s signed rank test. (* P < 0.05; ** P < 0.01; *** P < 0.001).
Change in MCC and F1 of predictions on the full set of 392 proteins and the better labelled NI1 subset of 71 proteins for various machine learning algorithms (shown in Table 2) with or without feature selection
| | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LR | 0.044 | 35.106 | *** | 0.141 | 47.960 | *** | 0.058 | 44. 183 | *** | 0.150 | 50.43 | *** |
| BN | 0.052 | 38.682 | *** | 0.157 | 52.233 | *** | 0.063 | 44.726 | *** | 0.165 | 54.78 | *** |
| FT | 0.029 | 21.798 | * | 0.140 | 47.035 | *** | 0.057 | 40.724 | *** | 0.160 | 52.48 | *** |
| RT | 0.022 | 20.943 | * | 0.133 | 46.655 | *** | 0.057 | 47.932 | *** | 0.154 | 52.67 | *** |
| AT | 0.048 | 35.068 | *** | 0.154 | 51.107 | *** | 0.059 | 41.926 | *** | 0.158 | 52.05 | *** |
| avg | 0.039 | 30.319 | 0.145 | 48.998 | 0.059 | 43.898 | 0.157 | 52.48 | ||||
Algorithm abbreviations in Table 1. Statistical significance was calculated using 1-sided Wilcoxon’s signed rank test. (* P < 0.05; ** < 0.01; *** P < 0.001).
Figure 2The average MCC calculated by leave-one-out cross-validation as the lowest scoring proteins were iteratively removed from the full dataset.
Figure 3The average MCC for the 10 best scoring proteins calculated by leave-one-out cross-validation as the lowest scoring proteins were iteratively removed from the full dataset.
Figure 4The average MCC calculated by leave-one-out cross-validation as the lowest scoring proteins were iteratively removed from the NI1 dataset.
Performance measures for RAD-T on Docking Benchmark of 129 proteins, resulting in 188 unbound complexes compared to other servers, including Cons-PPISP (Cons-P), PINUP, Promate, and Meta-PPISP (Meta-P), data for which were previously generated [[54]]
| TPr | 0.647 | 0.306 | 0.267 | 0.347 | 0.303 |
| PRC | 0.285 | 0.465 | 0.49 | 0.407 | 0.365 |
| F1 | 0.355 | 0.369 | 0.346 | 0.375 | 0.331 |
| MCC | 0.222 | 0.267 | 0.262 | 0.246 | 0.195 |
Comparative data for RAD-T and the servers, Cons-PPISP (Cons-P), PINUP, Promate, PIER, and Meta-PPISP (Meta-P)
| MCC | 0.264 | 0.147 | 0.166 | 0.151 | 0.136 | 0.230 | 0.166 | 0.151 |
| TPr | 0.809 | 0.322 | 0.255 | 0.285 | 0.939 | 0.836 | 0.527 | 0.322 |
| TNr | 0.458 | 0.810 | 0.879 | 0.836 | 0.152 | 0.377 | 0.611 | 0.810 |
| Prc | 0.447 | 0.493 | 0.547 | 0.529 | 0.400 | 0.441 | 0.482 | 0.493 |
| F1 | 0.576 | 0.390 | 0.348 | 0.370 | 0.561 | 0.577 | 0.449 | 0.390 |
| △MCC RAD-T | 0.000 | 0.117 | 0.098 | 0.113 | 0.129 | 0.034 | 0.098 | 0.113 |
| % MCC RAD-T | 0.00 | 79.19 | 58.73 | 75.26 | 94.78 | 14.88 | 59.11 | 75.26 |
| △ F1 RAD-T | 0.000 | 0.186 | 0.228 | 0.205 | 0.015 | -0.002 | 0.126 | 0.186 |
| % △ F1 RAD-T | 0.00 | 47.73 | 65.51 | 55.53 | 2.66 | -0.33 | 28.16 | 47.73 |
These servers were chosen for consistency with literature and because they were the prediction servers that demonstrated reliability or were not altogether offline. Averages and medians indicated do not include RAD-T.