| Literature DB >> 20377884 |
Jun-Feng Xia1, Xing-Ming Zhao, Jiangning Song, De-Shuang Huang.
Abstract
BACKGROUND: It is well known that most of the binding free energy of protein interaction is contributed by a few key hot spot residues. These residues are crucial for understanding the function of proteins and studying their interactions. Experimental hot spots detection methods such as alanine scanning mutagenesis are not applicable on a large scale since they are time consuming and expensive. Therefore, reliable and efficient computational methods for identifying hot spots are greatly desired and urgently required.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20377884 PMCID: PMC2874803 DOI: 10.1186/1471-2105-11-174
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Training set of protein structures
| PDB | First molecule | Second molecule |
|---|---|---|
| Angiogenin | Ribonuclease Inhibitor | |
| Human growth hormone | Human growth hormone binding protein | |
| Immunoglobulin Fab 5G9 | Tissue factor | |
| Barnase | Barstar | |
| Colicin E9 Immunity Im9 | Colicin E9 DNase | |
| BPTI Trypsin inhibitor | Chymotrypsin | |
| Blood coagulation factor VIIA | Tissue factor | |
| Idiotopic antibody FV D1.3 | Anti-idiotopic antibody FV E5.2 | |
| Fc fragment | Fragment B of protein A | |
| Fc (IGG1) | Protein G | |
| Envelope protein GP120 | CD4 | |
| Antibody A6 | Interferon-gamma receptor | |
| Mouse monoclonal antibody D1.3 | Hen egg lysozyme | |
| BPTI | Trypsin | |
| Hen Egg Lysozyme | lg FAB fragment HyHEL-10 |
Figure 1Feature importance. This figure presents the importance of 62 particular features and their contribution to the discriminative quality (in descending order) as measured by F-score. The meanings of the feature symbols are described in Additional file 2.
Prediction performance of individual-feature based SVM models
| Feature | Dataset | Specificity | Recall | Precision | Accuracy | F1 | TP | TN | FP | FN |
|---|---|---|---|---|---|---|---|---|---|---|
| RcsASA | Training set | 0.79 | 0.74 | 0.71 | 0.77 | 0.72 | 46 | 73 | 19 | 16 |
| Test set | 0.66 | 0.67 | 0.46 | 0.66 | 0.55 | 26 | 58 | 30 | 13 | |
| RctASA | Training set | 0.78 | 0.71 | 0.69 | 0.75 | 0.70 | 44 | 72 | 20 | 18 |
| Test set | 0.68 | 0.72 | 0.50 | 0.69 | 0.59 | 28 | 60 | 28 | 11 | |
| RcpASA | Training set | 0.78 | 0.79 | 0.71 | 0.79 | 0.75 | 49 | 72 | 20 | 13 |
| Test set | 0.70 | 0.59 | 0.47 | 0.67 | 0.52 | 23 | 62 | 26 | 16 | |
| BsRASA | Training set | 0.72 | 0.79 | 0.65 | 0.75 | 0.72 | 49 | 66 | 26 | 13 |
| Test set | 0.52 | 0.72 | 0.40 | 0.58 | 0.51 | 28 | 46 | 42 | 11 | |
| RcsmPI | Training set | 0.75 | 0.81 | 0.68 | 0.77 | 0.74 | 50 | 69 | 23 | 12 |
| Test set | 0.74 | 0.69 | 0.54 | 0.72 | 0.61 | 27 | 65 | 23 | 12 | |
| BtRASA | Training set | 0.72 | 0.69 | 0.62 | 0.71 | 0.66 | 43 | 66 | 26 | 19 |
| Test set | 0.56 | 0.72 | 0.42 | 0.61 | 0.53 | 28 | 49 | 39 | 11 | |
| BpRASA | Training set | 0.62 | 0.82 | 0.59 | 0.70 | 0.69 | 51 | 57 | 35 | 11 |
| Test set | 0.53 | 0.67 | 0.39 | 0.57 | 0.49 | 26 | 47 | 41 | 13 | |
| RctmPI | Training set | 0.76 | 0.73 | 0.67 | 0.75 | 0.70 | 45 | 70 | 22 | 17 |
| Test set | 0.78 | 0.67 | 0.58 | 0.75 | 0.62 | 26 | 69 | 19 | 13 | |
| BsASA | Training set | 0.61 | 0.81 | 0.58 | 0.69 | 0.68 | 50 | 56 | 36 | 12 |
| Test set | 0.61 | 0.59 | 0.40 | 0.61 | 0.48 | 23 | 54 | 34 | 16 |
The correlation coefficients among the nine best top-ranking features
| Feature | RcsASA | RctASA | RcpASA | BsRASA | RcsmPI | BtRASA | BpRASA | RctmPI | BsASA |
|---|---|---|---|---|---|---|---|---|---|
| RcsASA | 1.0000 | 0.9714 | 0.7168 | -0.8382 | 0.8582 | -0.8454 | -0.6826 | 0.8007 | -0.7866 |
| RctASA | 1.0000 | 0.7733 | -0.8234 | 0.8609 | -0.8632 | -0.7299 | 0.8357 | -0.7931 | |
| RcpASA | 1.0000 | -0.6201 | 0.6752 | -0.6364 | -0.6932 | 0.6770 | -0.5972 | ||
| BsRASA | 1.0000 | -0.7052 | 0.9555 | 0.7933 | -0.6724 | 0.9239 | |||
| RcsmPI | 1.0000 | -0.7477 | -0.6730 | 0.9536 | -0.6601 | ||||
| BtRASA | 1.0000 | 0.8563 | -0.7084 | 0.8966 | |||||
| BpRASA | 1.0000 | -0.6640 | 0.7465 | ||||||
| RctmPI | 1.0000 | -0.6399 | |||||||
| BsASA | 1.0000 |
Evaluation of the hot spot prediction using different machine learning classifiers based on the RcsASA feature
| Classifier | Dataset | Specificity | Recall | Precision | Accuracy | F1 | TP | TN | FP | FN |
|---|---|---|---|---|---|---|---|---|---|---|
| SVM | Training set | 0.79 | 0.74 | 0.71 | 0.77 | 0.72 | 46 | 73 | 19 | 16 |
| Test set | 0.66 | 0.67 | 0.46 | 0.66 | 0.55 | 26 | 58 | 30 | 13 | |
| Bayes Net | Training set | 0.79 | 0.56 | 0.65 | 0.70 | 0.60 | 35 | 73 | 19 | 27 |
| Test set | 0.85 | 0.28 | 0.46 | 0.68 | 0.35 | 11 | 75 | 13 | 28 | |
| Naïve Bayes | Training set | 0.75 | 0.81 | 0.68 | 0.77 | 0.74 | 50 | 69 | 23 | 12 |
| Test set | 0.58 | 0.72 | 0.43 | 0.62 | 0.54 | 28 | 51 | 37 | 11 | |
| RBF Network | Training set | 0.85 | 0.63 | 0.74 | 0.76 | 0.67 | 39 | 78 | 14 | 23 |
| Test set | 0.76 | 0.62 | 0.53 | 0.72 | 0.57 | 24 | 67 | 21 | 15 | |
| Decision Tree (J48) | Training set | 0.87 | 0.53 | 0.73 | 0.73 | 0.62 | 33 | 80 | 12 | 29 |
| Test set | 0.84 | 0.28 | 0.44 | 0.67 | 0.34 | 11 | 74 | 14 | 28 | |
| Decision Table | Training set | 0.79 | 0.56 | 0.65 | 0.70 | 0.60 | 35 | 73 | 19 | 27 |
| Test set | 0.85 | 0.28 | 0.46 | 0.68 | 0.35 | 11 | 75 | 13 | 28 |
Evaluation of hot spot prediction using the majority voting method based on the independent test set
| Classifier number | Specificity | Recall | Precision | Accuracy | F1 | TP | TN | FP | FN |
|---|---|---|---|---|---|---|---|---|---|
| 9 (all) | 0.67 | 0.69 | 0.48 | 0.68 | 0.57 | 27 | 59 | 29 | 12 |
| 7 (F1 > 0.50) | 0.68 | 0.69 | 0.49 | 0.69 | 0.58 | 27 | 60 | 28 | 12 |
| 3 (F1 > 0.59) | 0.76 | 0.72 | 0.57 | 0.75 | 0.64 | 28 | 67 | 21 | 11 |
Performance comparison with different methods based on the independent test set
| Method | Specificity | Recall | Precision | F1 | ΔF1 |
|---|---|---|---|---|---|
| Robetta | 0.87 | 0.33 | 0.52 | 0.40 | ** |
| FOLDEF | 0.88 | 0.26 | 0.48 | 0.34 | -0.06 |
| KFC | 0.85 | 0.31 | 0.48 | 0.37 | -0.03 |
| MINERVA | 0.44 | 0.52 | +0.12 | ||
| APIS (this work) | 0.76 | 0.57 |
The highest value in each column is shown in bold.
Figure 2The visualization of prediction results for chain A (white) and chain E (blue) of protein complex 1CDL using (a) APIS, (b) KFC, and (c) MINERVA. The following color scheme is used: true positives (known hot spots predicted correctly) in red, true negatives (actual non-hot spots predicted correctly) in yellow, false positives (non-hot spots predicted as hot spots) in green, false negatives (known hot spots not predicted correctly) in purple. In this case, 9 of 12 residues are correctly predicted by our method.
Figure 3The visualization of prediction results for chain A (white) of protein complex 1G3I (Chain G not shown) using (a) APIS, (b) KFC, and (c) MINERVA. Red residues are actual hot spots predicted correctly, purple residues are actual hot spots not predicted correctly.