| Literature DB >> 19925685 |
Bin Liu1, Xiaolong Wang, Lei Lin, Buzhou Tang, Qiwen Dong, Xuan Wang.
Abstract
BACKGROUND: Predicting the binding sites between two interacting proteins provides important clues to the function of a protein. Recent research on protein binding site prediction has been mainly based on widely known machine learning techniques, such as artificial neural networks, support vector machines, conditional random field, etc. However, the prediction performance is still too low to be used in practice. It is necessary to explore new algorithms, theories and features to further improve the performance.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19925685 PMCID: PMC2785799 DOI: 10.1186/1471-2105-10-381
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1An example of comparison of classification method with sequential labelling method for protein binding site prediction. For the predicted labels, I and N represent interface residue and non-interface residue respectively.
Performance of HM-SVM versus other methods on all data sets
| Data set | Method | Specificity+ (random)a | Sensitivity+ (random)b | F1 | Accuracy | MCC | AUC | |
|---|---|---|---|---|---|---|---|---|
| Hetero-complex Id | ANN | 37.6% (28.1%) | 59.4% (16.7%) | 46.0% | 60.9% | 18.9% | 64.5% | 326 |
| SVM | 38.4% (28.1%) | 59.8% (16.8%) | 46.8% | 61.8% | 20.2% | 65.4% | 179461 | |
| CRF | 42.6% (28.1%) | 55.2% (15.5%) | 48.0% | 66.5% | 24.4% | 65.3% | 12151 | |
| HM-SVM | 44.9% (28.1%) | 56.0% (15.7%) | 49.8% | 68.3% | 27.4% | 69.5% | 356 | |
| Homo-complex I | ANN | 39.0% (27.0%) | 58.4% (15.8%) | 46.6% | 63.9% | 22.1% | 67.0% | 586 |
| SVM | 39.6% (27.0%) | 61.9% (16.7%) | 48.3% | 64.2% | 24.2% | 68.6% | 224979 | |
| CRF | 45.1% (27.0%) | 59.2% (16.0%) | 51.2% | 69.5% | 30.2% | 67.6% | 16961 | |
| HM-SVM | 45.4% (27.0%) | 60.0% (16.2%) | 51.7% | 69.7% | 30.9% | 72.2% | 588 | |
| MixeI | ANN | 40.3% (27.5%) | 51.4% (14.1%) | 44.7% | 65.4% | 20.8% | 65.8% | 1242 |
| SVM | 39.5% (27.5%) | 61.5% (16.9%) | 48.1% | 63.6% | 23.3% | 67.6% | 831579 | |
| CRF | 44.3% (27.5%) | 57.5% (15.8%) | 49.9% | 68.4% | 28.0% | 66.8% | 28364 | |
| HM-SVM | 45.5% (27.5%) | 58.0% (15.9%) | 51.0% | 69.4% | 29.7% | 71.2% | 891 | |
| Hetero-complex IIf | ANN | 45.9% (34.9%) | 60.5% (21.1%) | 52.1% | 61.3% | 21.3% | 65.8% | 604 |
| SVM | 47.9% (34.9%) | 61.6% (21.5%) | 53.9% | 63.2% | 24.6% | 67.7% | 160625 | |
| CRF | 51.6% (34.9%) | 57.6% (20.1%) | 54.3% | 66.3% | 28.0% | 67.3% | 13441 | |
| HM-SVM | 54.0% (34.9%) | 56.7% (19.8%) | 55.3% | 68.0% | 30.5% | 70.7% | 464 | |
| Homo-complex II | ANN | 43.9% (32.3%) | 66.7% (21.5%) | 52.8% | 61.5% | 24.1% | 68.1 | 856 |
| SVM | 47.1% (32.3%) | 63.1% (20.4%) | 54.0% | 65.2% | 27.7% | 70.2% | 554054 | |
| CRF | 52.5% (32.3%) | 59.7% (19.3%) | 55.9% | 69.5% | 32.9% | 68.7% | 18124 | |
| HM-SVM | 53.3% (32.3%) | 60.1% (19.4%) | 56.5% | 70.1% | 34.0% | 73.4% | 851 | |
| Mix II | ANN | 46.5% (33.3%) | 53.4% (17.9%) | 49.4% | 63.7% | 21.7% | 65.8% | 1260 |
| SVM | 47.5% (33.3%) | 62.3% (20.8%) | 53.9% | 64.5% | 26.5% | 69.2% | 1316103 | |
| CRF | 52.2% (33.3%) | 58.6% (19.5%) | 55.2% | 68.3% | 30.9% | 68.1% | 856765 | |
| HM-SVM | 53.6% (33.3%) | 58.6% (19.6%) | 56.0% | 69.3% | 32.6% | 72.4% | 1320 | |
Specificity+ = TP/(TP+FP); Sensitivity+ = TP/(TP+FN); F1 = 2 × Specificity+ × Sensitivity+/(Specificity++Sensitivity+); Accuracy = (TP+TN)/(TP+TN+FP+FN); MCC = (TP × TN-FP × FN)/; AUC: Area Under ROC Curve [61]. Where TP is the number of true positives (residues predicted to be interface residues that actually are interface residues); FP the number of false positives (residues predicted to be interface residues that are in fact not interface residues); TN the number of true negatives; FN the number of false negatives.
aValues in parentheses are randomly predicted values. The specificity+ of random prediction is calculated as: the total number of interaction sites residues/the total number of residues.
bValues in parentheses are randomly predicted values. The sensitivity+ of random prediction is calculated as: the total number of predicted residues as interaction sites by each method/the total number of residues.
cThe total running time (second) for 5-fold cross-validation, including training and testing.
dType I data set with minor interface as negative samples.
eThe mixed data set of hetero-complexes and homo-complexes.
fType II data set with minor interface as positive samples.
Figure 2ROC cures on the six data sets. The ROC cures of ANN, SVM, CRF, HM-SVM on the six data sets: (a) Hetero-complex I, (b) Homo-complex I, (c) Mix I, (d) Hetero-complex II, (e) Homo-complex II, (f) Mix II.
Summary of computational costs of different methods
| ANN | SVM | CRF | HM-SVM | |
|---|---|---|---|---|
| Train | L | H | H | L |
| Test | L | H | L | L |
L and H represent low computational cost and high computational cost, respectively.
Figure 3Performance changing curves of different methods trained with different number of training samples on mix I data set.
Figure 4Running time changing curves of different methods trained with different number of training samples on mix I data set. The results are obtained on a personal computer with CPU of Intel Pentium 2.2 GHz and memory of 3G.
Performance of HM-SVM trained on modified training sets
| Data set | Specificity+ | Sensitivity+ | F1 | Accuracy | MCC | AUC |
|---|---|---|---|---|---|---|
| Hetero-complex I | 43.8% | 58.3% | 49.5% | 66.7% | 26.5% | 69.1% |
| Homo-complex I | 44.9% | 58.4% | 49.9% | 68.4% | 29.0% | 71.9% |
| Mix I | 42.8% | 62.0% | 49.6% | 65.7% | 27.2% | 70.5% |
| Hetero-complex II | 53.7% | 56.1% | 54.2% | 67.4% | 29.4% | 70.3% |
| Homo-complex II | 53.7% | 55.4% | 53.0% | 69.0% | 31.3% | 72.4% |
| Mix II | 54.5% | 54.9% | 54.5% | 69.6% | 31.8% | 72.0% |
Figure 5Performance changing curves of HM-SVM using different window size on mix I data set.
Figure 6Representative prediction results on hetero-complex I data set. The target protein (PDB code 1ukv:Y) for which the predictions are made is shown in slate. Predicted interface residues are shown in magenta. The binding partner (PDB code 1ukv:G) is shown in blue. (a) The actual interface residues. (b) ANN. (c) SVM. (d) CRF. (e) HM-SVM.
Figure 7Representative prediction results on homo-complex I data set. The target protein (PDB code 1vz8:B) for which the predictions are made is shown in slate. Predicted interface residues are shown in magenta. The binding partner (PDB code 1vz8:A) is shown in blue. (a) The actual interface residues. (b) ANN. (c) SVM. (d) CRF. (e) HM-SVM.
Summary of six types of data sets
| Data set | Chains | Interface res.a | ||
|---|---|---|---|---|
| Hetero-complex I | 504 | 109829 | 92797 | 26085 (28.1%) |
| Homo-complex I | 620 | 172917 | 141295 | 38170 (27.0%) |
| Mix I | 1124 | 282746 | 234092 | 64255 (27.4%) |
| Hetero-complex II | 504 | 109829 | 92797 | 32386 (34.9%) |
| Homo-complex II | 620 | 172917 | 141295 | 45633 (32.3%) |
| Mix II | 1124 | 282746 | 234092 | 78019 (33.3%) |
aGiven in the bracket are the fraction of interface residues in the total number of surface residues.