| Literature DB >> 30287797 |
Siyu Liu1, Chuyao Liu2, Lei Deng3.
Abstract
Hot spots are the subset of interface residues that account for most of the binding free energy, and they play essential roles in the stability of protein binding. Effectively identifying which specific interface residues of protein⁻protein complexes form the hot spots is critical for understanding the principles of protein interactions, and it has broad application prospects in protein design and drug development. Experimental methods like alanine scanning mutagenesis are labor-intensive and time-consuming. At present, the experimentally measured hot spots are very limited. Hence, the use of computational approaches to predicting hot spots is becoming increasingly important. Here, we describe the basic concepts and recent advances of machine learning applications in inferring the protein⁻protein interaction hot spots, and assess the performance of widely used features, machine learning algorithms, and existing state-of-the-art approaches. We also discuss the challenges and future directions in the prediction of hot spots.Entities:
Keywords: hot spots; machine learning; performance evaluation; protein-protein interaction
Mesh:
Substances:
Year: 2018 PMID: 30287797 PMCID: PMC6222875 DOI: 10.3390/molecules23102535
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Figure 1Overview of machine learning approaches to predicting protein–protein interaction hot spots. For the binding of interface residues in protein–protein interactions, a large number and variety of features are extracted from diverse data sources. Then feature extraction and feature selection approaches are used for dimensionality reduction. Finally, the machine learning-based prediction models are trained and applied to make predictions of hot spots.
Summary of machine learning classification methods for protein–protein interaction hot spot prediction.
| Classification Methods | Description | References |
|---|---|---|
| Nearest neighbor | The model consists of 83 classifiers using the IBk algorithm, where instances are encoded by sequence properties. | Hu et al. [ |
| Training the IBk classifier through the training dataset to obtain several better random projections and then applying them to the test dataset. | Jiang et al. [ | |
| Support vector machine | The decision tree is used to perform feature selection and the SVM is applied to create a predictive model. | Cho et al. [ |
| F-score is used to remove redundant and irrelevant features, and SVM is used to train the model. | Xia et al. [ | |
| Proposed two new models of KFC through SVM training | Darnell et al. [ | |
| The two-step feature selection method is used to select 38 optimal features, and then the SVM method is used to establish the prediction model. | Deng et al. [ | |
| The random forest algorithm is used to select the optimal 58 features, and then the SVM algorithm is used to train the model. | Ye et al. [ | |
| Use the two-step selection method to select the two best features, and then use the SVM algorithm to build the classifier. | Xia et al. [ | |
| When the interface area is unknown, it is also very effective to use this method. | Qian et al. [ | |
| Decision trees | Formed by a combination of two decision tree models, K-FADE and K-CON. | Darnell et al. [ |
| Bayesian networks | Can handle some of the missing protein data, as well as unreliable conditions. | Assi et al. [ |
| Neural networks | Does not need to know the interacting partner. | Ofran and Rost [ |
| Ensemble learning | The mRMR algorithm is used to select features, SMOTE is used to handle the unbalanced data, and finally AdaBoost is used to make prediction. | Huang and Zhang [ |
| Random forest (RF) is used to effectively integrate hybrid features. | Wang et al. [ | |
| Bootstrap resampling approaches and decision fusion techniques are used to train and integrate sub-classifiers. | Deng et al. [ |
Performance comparison of different features on the benchmark dataset (HB34).
| Methods | Features | SPE | SEN | PRE | ACC | F1 | MCC | AUC |
|---|---|---|---|---|---|---|---|---|
| SVM | Physicochemical | 0.672 | 0.521 | 0.545 | 0.608 | 0.520 | 0.196 | 0.566 |
| PSSM | 0.696 | 0.504 | 0.553 | 0.614 | 0.515 | 0.204 | 0.634 | |
| Blocks substitution matrix | 0.644 | 0.522 | 0.529 | 0.594 | 0.511 | 0.170 | 0.595 | |
| ASA | 0.677 | 0.688 | 0.612 | 0.660 | 0.638 | 0.362 | 0.737 | |
| Solvent exposure | 0.609 | 0.726 | 0.580 | 0.658 | 0.635 | 0.339 | 0.724 | |
| Combined | 0.711 | 0.638 | 0.684 | 0.699 | 0.652 | 0.393 | 0.757 | |
| RF | Physicochemical | 0.624 | 0.549 | 0.521 | 0.592 | 0.522 | 0.174 | 0.635 |
| PSSM | 0.682 | 0.561 | 0.567 | 0.632 | 0.555 | 0.244 | 0.648 | |
| Blocks substitution matrix | 0.620 | 0.550 | 0.521 | 0.590 | 0.523 | 0.17 | 0.632 | |
| ASA | 0.722 | 0.587 | 0.614 | 0.664 | 0.589 | 0.312 | 0.696 | |
| Solvent exposure | 0.682 | 0.552 | 0.565 | 0.626 | 0.549 | 0.236 | 0.669 | |
| Combined | 0.756 | 0.656 | 0.624 | 0.699 | 0.631 | 0.384 | 0.766 | |
| GTB | Physicochemical | 0.587 | 0.586 | 0.514 | 0.586 | 0.535 | 0.173 | 0.635 |
| PSSM | 0.612 | 0.641 | 0.550 | 0.624 | 0.584 | 0.251 | 0.669 | |
| Blocks substitution matrix | 0.591 | 0.588 | 0.517 | 0.591 | 0.540 | 0.179 | 0.635 | |
| ASA | 0.665 | 0.648 | 0.588 | 0.658 | 0.608 | 0.310 | 0.693 | |
| Solvent exposure | 0.624 | 0.639 | 0.558 | 0.631 | 0.587 | 0.261 | 0.669 | |
| Combined | 0.717 | 0.656 | 0.727 | 0.719 | 0.681 | 0.439 | 0.787 |
Performance comparison of feature combinations on the benchmark dataset (HB34) using GTB.
| Methods | Features | SPE | SEN | PRE | ACC | F1 | MCC | AUC |
|---|---|---|---|---|---|---|---|---|
|
| ASA + PSSM | 0.708 | 0.705 | 0.642 | 0.707 | 0.663 | 0.410 | 0.761 |
| PSSM + Solvent exposure | 0.671 | 0.718 | 0.617 | 0.691 | 0.656 | 0.385 | 0.760 | |
| Blosum62 + Solvent exposure | 0.664 | 0.699 | 0.606 | 0.679 | 0.640 | 0.359 | 0.734 | |
| ASA + Solvent exposure | 0.674 | 0.695 | 0.612 | 0.683 | 0.642 | 0.366 | 0.728 | |
| Phy+Solvent exposure | 0.664 | 0.696 | 0.605 | 0.677 | 0.639 | 0.357 | 0.728 | |
| ASA + Blosum62 | 0.658 | 0.651 | 0.585 | 0.656 | 0.608 | 0.307 | 0.718 | |
| ASA + Phy | 0.669 | 0.644 | 0.590 | 0.658 | 0.607 | 0.311 | 0.717 | |
| Phy + PSSM | 0.629 | 0.650 | 0.566 | 0.638 | 0.597 | 0.277 | 0.683 | |
| PSSM + Blosum62 | 0.619 | 0.655 | 0.560 | 0.635 | 0.595 | 0.271 | 0.679 | |
| Phy + Blosum62 | 0.593 | 0.590 | 0.520 | 0.592 | 0.541 | 0.183 | 0.639 | |
| Combined (all features) | 0.717 | 0.656 | 0.727 | 0.719 | 0.681 | 0.439 | 0.787 |
Performance comparison of different features on the independent test dataset (BID18).
| Methods | Features | SPE | SEN | PRE | ACC | F1 | MCC | AUC |
|---|---|---|---|---|---|---|---|---|
| SVM | Physicochemical | 0.577 | 0.393 | 0.597 | 0.583 | 0.472 | 0.162 | 0.634 |
| PSSM | 0.675 | 0.438 | 0.561 | 0.640 | 0.491 | 0.223 | 0.663 | |
| Blocks substitution matrix | 0.626 | 0.435 | 0.632 | 0.628 | 0.512 | 0.242 | 0.661 | |
| ASA | 0.597 | 0.446 | 0.716 | 0.634 | 0.549 | 0.290 | 0.693 | |
| Solvent exposure | 0.642 | 0.403 | 0.532 | 0.608 | 0.456 | 0.167 | 0.617 | |
| Combined | 0.569 | 0.464 | 0.832 | 0.650 | 0.586 | 0.353 | 0.732 | |
| RF | Physicochemical | 0.632 | 0.414 | 0.576 | 0.614 | 0.479 | 0.196 | 0.624 |
| PSSM | 0.703 | 0.417 | 0.474 | 0.632 | 0.443 | 0.171 | 0.616 | |
| Blocks substitution matrix | 0.62 | 0.408 | 0.575 | 0.607 | 0.474 | 0.185 | 0.627 | |
| ASA | 0.604 | 0.437 | 0.686 | 0.629 | 0.534 | 0.268 | 0.679 | |
| Solvent exposure | 0.59 | 0.402 | 0.612 | 0.597 | 0.484 | 0.188 | 0.64 | |
| Combined | 0.612 | 0.466 | 0.753 | 0.656 | 0.575 | 0.338 | 0.758 | |
| GTB | Physicochemical | 0.531 | 0.384 | 0.643 | 0.566 | 0.478 | 0.163 | 0.625 |
| PSSM | 0.681 | 0.416 | 0.506 | 0.627 | 0.456 | 0.178 | 0.638 | |
| Blocks substitution matrix | 0.580 | 0.400 | 0.617 | 0.592 | 0.480 | 0.184 | 0.624 | |
| ASA | 0.585 | 0.437 | 0.718 | 0.626 | 0.543 | 0.280 | 0.679 | |
| Solvent exposure | 0.592 | 0.389 | 0.579 | 0.588 | 0.465 | 0.159 | 0.646 | |
| Combined | 0.621 | 0.476 | 0.766 | 0.666 | 0.597 | 0.378 | 0.769 |
Figure 2A Venn diagram showing the number of correctly predicted residues from the three machine learning algorithms for the independent dataset (BID18).
Detailed prediction results for each protein on the independent test dataset (BID18).
| PDB ID | GTB | RF | SVM | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TP | FP | TN | FN | TP | FP | TN | FN | TP | FP | TN | FN | ||
| 1CDL_A | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | |
| 1CDL_E | 5 | 3 | 1 | 0 | 5 | 1 | 3 | 0 | 5 | 3 | 1 | 0 | |
| 1DVA_H | 0 | 4 | 7 | 1 | 0 | 4 | 7 | 1 | 0 | 4 | 7 | 1 | |
| 1DVA_X | 3 | 3 | 4 | 1 | 4 | 2 | 5 | 0 | 4 | 3 | 4 | 0 | |
| 1DX5_N | 1 | 1 | 13 | 2 | 1 | 2 | 12 | 2 | 2 | 3 | 12 | 0 | |
| 1EBP_A | 3 | 0 | 1 | 0 | 3 | 0 | 1 | 0 | 3 | 0 | 1 | 0 | |
| 1EBP_C | 1 | 3 | 1 | 0 | 1 | 1 | 3 | 0 | 1 | 0 | 4 | 0 | |
| 1ES7_A | 1 | 3 | 0 | 0 | 0 | 3 | 0 | 1 | 1 | 3 | 0 | 0 | |
| 1FAK_T | 2 | 5 | 14 | 0 | 2 | 5 | 14 | 0 | 2 | 7 | 12 | 0 | |
| 1FE8_A | 0 | 3 | 1 | 0 | 0 | 3 | 1 | 0 | 0 | 3 | 1 | 0 | |
| 1FOE_B | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | |
| 1G3I_A | 6 | 0 | 0 | 0 | 5 | 0 | 0 | 1 | 6 | 0 | 0 | 0 | |
| 1GL4_A | 4 | 1 | 1 | 1 | 3 | 2 | 0 | 2 | 3 | 1 | 1 | 2 | |
| 1IHB_B | 0 | 2 | 2 | 0 | 0 | 2 | 2 | 0 | 0 | 2 | 2 | 0 | |
| 1JAT_A | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | |
| 1JAT_B | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | |
| 1JPP_B | 0 | 2 | 3 | 2 | 1 | 3 | 2 | 1 | 2 | 5 | 0 | 0 | |
| 1MQ8_B | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | |
| 1NFI_F | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | |
| 1NUN_A | 0 | 2 | 1 | 0 | 0 | 2 | 1 | 0 | 0 | 2 | 1 | 0 | |
| 1UB4_C | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | |
| 2HHB_B | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | |
Performance comparison of existing approaches on the independent test dataset (BID18).
| Methods | Classifier | SPE | SEN | PRE | ACC | F1 | MCC | |
|---|---|---|---|---|---|---|---|---|
| HEP | SVM | 0.76 | 0.6 | 0.84 | 0.79 | 0.70 | 0.56 | |
| PredHS-SVM | SVM | 0.93 | 0.79 | 0.59 | 0.83 | 0.68 | 0.57 | |
| iPPHOT | SVM | 0.586 | 0.462 | 0.794 | 0.650 | 0.584 | 0.353 | |
| KFC2a | SVM | 0.73 | 0.55 | 0.74 | 0.73 | 0.63 | 0.44 | |
| KFC2b | SVM | 0.87 | 0.64 | 0.55 | 0.77 | 0.60 | 0.44 | |
| PCRPi | Bayesian network | 0.75 | 0.51 | 0.39 | 0.69 | 0.44 | 0.25 | |
| MINERVA | SVM | 0.90 | 0.65 | 0.44 | 0.76 | 0.52 | 0.38 | |
| APIS | SVM | 0.76 | 0.57 | 0.72 | 0.75 | 0.64 | 0.45 | |
| KFC | Decision trees | 0.85 | 0.48 | 0.31 | 0.69 | 0.38 | 0.19 | |
| Robetta | Knowledge-based method | 0.88 | 0.52 | 0.33 | 0.72 | 0.41 | 0.25 | |
| FOLDEF | Knowledge-based method | 0.88 | 0.48 | 0.26 | 0.69 | 0.34 | 0.17 |