| Literature DB >> 30250210 |
Hao Wang1, Chuyao Liu1, Lei Deng2.
Abstract
Identification of hot spots, a small portion of protein-protein interface residues that contribute the majority of the binding free energy, can provide crucial information for understanding the function of proteins and studying their interactions. Based on our previous method (PredHS), we propose a new computational approach, PredHS2, that can further improve the accuracy of predicting hot spots at protein-protein interfaces. Firstly we build a new training dataset of 313 alanine-mutated interface residues extracted from 34 protein complexes. Then we generate a wide variety of 600 sequence, structure, exposure and energy features, together with Euclidean and Voronoi neighborhood properties. To remove redundant and irrelevant information, we select a set of 26 optimal features utilizing a two-step feature selection method, which consist of a minimum Redundancy Maximum Relevance (mRMR) procedure and a sequential forward selection process. Based on the selected 26 features, we use Extreme Gradient Boosting (XGBoost) to build our prediction model. Performance of our PredHS2 approach outperforms other machine learning algorithms and other state-of-the-art hot spot prediction methods on the training dataset and the independent test set (BID) respectively. Several novel features, such as solvent exposure characteristics, second structure features and disorder scores, are found to be more effective in discriminating hot spots. Moreover, the update of the training dataset and the new feature selection and classification algorithms play a vital role in improving the prediction quality.Entities:
Mesh:
Year: 2018 PMID: 30250210 PMCID: PMC6155324 DOI: 10.1038/s41598-018-32511-1
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Flowchart of PredHS2. Firstly, the training dataset is generated by integrating four datasets including ASEdb, SKEMPI, Ab+ and Alexov_sDB. And the independent dataset is extracted from the BID database. The residues in the datasets are encoded using a large number of sequence, structure, energy and exposure features and two categories of structural neighborhood properties (Euclidean and Voronoi). As a result, a total of 200 site features, 200 Euclidean features and 200 Voronoi features are obtained. Then a two-step feature selection approach is applied to select the optimal feature set. Finally, the prediction classifier is built using Extreme Gradient Boosting based on the optimal feature set.
Figure 2Performance of the two-step feature selection. (a) Shows the R scores of the top-K features and (b) shows the F1 and MCC scores of the top-K features.
The performance of the two-step feature selection method in comparison with other feature selection methods.
| Method | ACC | SPE | PRE | SEN | F1 | MCC |
|---|---|---|---|---|---|---|
| All features | 0.753 | 0.806 | 0.721 | 0.677 | 0.689 | 0.487 |
| RF | 0.808 | 0.862 | 0.799 | 0.722 | 0.756 | 0.598 |
| RFE | 0.811 | 0.846 | 0.809 | 0.769 | 0.774 | 0.626 |
| mRMR | 0.794 | 0.826 | 0.769 | 0.763 | 0.757 | 0.588 |
| Two-step | 0.818 | 0.844 | 0.786 | 0.783 | 0.782 | 0.63 |
Figure 3The feature importance of the selected 26 features.
The optimal 26 features for identifying hot spots based on the two-step feature selection method.
| Rank | Feature name | Symbol | F-score | Feature type |
|---|---|---|---|---|
| 1 | Weighted Solvent exposure features (HSEAU) | W_HSEAU | 0.9346 | Site |
| 2 | Weighted Solvent exposure features (HSEBU) in Euclidean neighborhood | W_HSEBU_EN | 0.7007 | Euclidian |
| 3 | Weighted normalized residue contacts in complex in Euclidean neighborhood | W_Ncrc_EN | 0.6894 | Euclidian |
| 4 | Weighted Side-chain environment (pKa_1) | W_Pka1 | 0.6737 | Site |
| 5 | Weighted Disorder_6 score in Voronoi neighborhood | W_Disorder6_VN | 0.6546 | Voronoi |
| 6 | Δ(delta) normalized residue contacts | Delncr | 0.6086 | Site |
| 7 | Pair potentials in monomer | Ppm | 0.3576 | Site |
| 8 | Weighted Blosum (A) in Voronoi neighborhood | W_BlosumA_VN | 0.2991 | Voronoi |
| 9 | Weighted Blosum (T) | W_BlosumT | 0.2258 | Site |
| 10 | Weighted Sidechain energy score | W_SCE1 | 0.1991 | Voronoi |
| 11 | Side chain energy score (SCE-score (conserv)) | SCE4 | 0.1842 | Site |
| 12 | Weighted Second Structure (SS) helix in Voronoi neighborhood | W_SS1_VN | 0.1716 | Voronoi |
| 13 | Second Structure (SS) coil in Voronoi neighborhood | SS3_VN | 0.1675 | Voronoi |
| 14 | Weighted Disorder_4 score | W_Disorder4 | 0.1502 | Site |
| 15 | SCE-score (conbine_1) in Euclidean neighborhood | SCE5_EN | 0.13108 | Euclidian |
| 16 | PSSM (Q) | PssmQ | 0.1261 | Site |
| 17 | Hydrogen bonds in Euclidean neighborhood | Hb_EN | 0.1163 | Euclidian |
| 18 | Weighted PSSM (H) | W_PssmH | 0.1117 | Site |
| 19 | Blosum (W) | BlosumW | 0.1078 | Site |
| 20 | Weighted Disorder_5 score | W_Disorder5 | 0.0502 | Site |
| 21 | Weighted SCE-score (conbine_1) in Euclidean neighborhood | W_SCE5_EN | 0.02217 | Euclidian |
| 22 | Disorder_6 score | Disorder6 | 0.01865 | Site |
| 23 | Weighted PSSM (C) in Voronoi neighborhood | W_PssmC_VN | 0.01427 | Voronoi |
| 24 | Physicochemical properties (polarity) in Euclidean neighborhood | polarity_EN | 0.001 | Euclidian |
| 25 | PSSM (V) in Voronoi neighborhood | PssmV_VN | 0.00069 | Voronoi |
| 26 | Blosum (F) in Voronoi neighborhood | BlosumF_VN | 0.00018 | Voronoi |
Figure 4Box plot of hot spots and non-hot spots concerning their W_HSEAU (A), W_HSEBU_EN (B) and W_Ncrc_EN (C) in training dataset and W_HSEAU (D), W_HSEBU_EN (E) and W_Ncrc_EN (F) in test dataset, respectively. In each box plot, the bottom and top are severally the lower and upper quartiles and the middle line of the box is the median.
Comparison with other machine learning methods on the training dataset with 10-fold cross-validation.
| Method | ACC | SPE | PRE | SEN | F1 | MCC |
|---|---|---|---|---|---|---|
| RF | 0.700 | 0.827 | 0.695 | 0.528 | 0.597 | 0.377 |
| SVM | 0.702 | 0.789 | 0.674 | 0.587 | 0.621 | 0.388 |
| GTB | 0.761 | 0.800 | 0.717 | 0.709 | 0.709 | 0.510 |
| MLP | 0.648 | 0.655 | 0.603 | 0.640 | 0.600 | 0.306 |
| PredHS2 | 0.818 | 0.844 | 0.786 | 0.783 | 0.782 | 0.630 |
Performance comparison of PredHS2 and other existing methods on the independent test dataset.
| Method | TP | TN | FP | FN | ACC | SPE | PRE | SEN | F1 | MCC |
|---|---|---|---|---|---|---|---|---|---|---|
| PredHS2 | 30 | 80 | 7 | 9 | 0.87 | 0.92 | 0.81 | 0.77 | 0.79 | 0.70 |
| iPPHOT | 31 | 51 | 36 | 8 | 0.65 | 0.59 | 0.46 | 0.79 | 0.58 | 0.35 |
| HEP | 32 | 68 | 21 | 6 | 0.79 | 0.76 | 0.60 | 0.84 | 0.70 | 0.56 |
| PredHS-SVM | 23 | 81 | 6 | 16 | 0.83 | 0.93 | 0.79 | 0.59 | 0.68 | 0.57 |
| APIS | 28 | 67 | 21 | 11 | 0.75 | 0.76 | 0.57 | 0.72 | 0.64 | 0.45 |
| Robetta | 12 | 80 | 11 | 24 | 0.72 | 0.88 | 0.52 | 0.33 | 0.41 | 0.25 |
| FOLDEF | 10 | 78 | 11 | 28 | 0.69 | 0.88 | 0.48 | 0.26 | 0.34 | 0.17 |
| KFC | 12 | 75 | 12 | 27 | 0.69 | 0.85 | 0.48 | 0.31 | 0.38 | 0.19 |
| MINERVA | 17 | 79 | 9 | 22 | 0.76 | 0.9 | 0.65 | 0.44 | 0.52 | 0.38 |
| KFC2a | 29 | 64 | 24 | 10 | 0.73 | 0.73 | 0.55 | 0.74 | 0.63 | 0.44 |
| KFC2b | 21 | 77 | 12 | 17 | 0.77 | 0.87 | 0.65 | 0.55 | 0.60 | 0.44 |
Figure 5Comparison of PredHS2, iPPHOT and PredHS-SVM methods on the independent test dataset. (A) is the ROC curves; (B) is the Precision-Recall curves.
Figure 6Hot spot prediction results by using PredHS2 (A) and iPPHOT (B) for the EPO receptor complex. True positives (red), true negatives (yellow), false positives (green) and false negatives (purple) are colored. Chain A is colored in cyan and chain C is colored in blue.