| Literature DB >> 30901953 |
Lei Deng1, Yuanchao Sui2, Jingpu Zhang3.
Abstract
Hot spot residues at protein⁻RNA complexes are vitally important for investigating the underlying molecular recognition mechanism. Accurately identifying protein⁻RNA binding hot spots is critical for drug designing and protein engineering. Although some progress has been made by utilizing various available features and a series of machine learning approaches, these methods are still in the infant stage. In this paper, we present a new computational method named XGBPRH, which is based on an eXtreme Gradient Boosting (XGBoost) algorithm and can effectively predict hot spot residues in protein⁻RNA interfaces utilizing an optimal set of properties. Firstly, we download 47 protein⁻RNA complexes and calculate a total of 156 sequence, structure, exposure, and network features. Next, we adopt a two-step feature selection algorithm to extract a combination of 6 optimal features from the combination of these 156 features. Compared with the state-of-the-art approaches, XGBPRH achieves better performances with an area under the ROC curve (AUC) score of 0.817 and an F1-score of 0.802 on the independent test set. Meanwhile, we also apply XGBPRH to two case studies. The results demonstrate that the method can effectively identify novel energy hotspots.Entities:
Keywords: XGBoost; hot spots; protein–RNA interfaces; two-step feature selection
Mesh:
Substances:
Year: 2019 PMID: 30901953 PMCID: PMC6471955 DOI: 10.3390/genes10030242
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Flowchart of XGBPRH method. The experimental dataset of 47 protein–RNA complexes comes from Pan et al.’s work [18]. We extracted 156 network, exposure, sequence, and structure features. We then adopted the McTWO feature selection algorithm to select the optimal features and used the selected optimal features to train the eXtreme Gradient Boosting (XGBoost) classifier. Finally, we evaluated the performance on the training dataset and independent dataset.
The dataset of 47 protein–RNA complexes (Protein DataBank [PDB] codes).
| Training dataset | 1ASY | 1B23 | 1JBS | 1U0B | 1URN | 1YVP | 2BX2 | 2IX1 |
| 2M8D | 2PJP | 2Y8W | 2ZI0 | 2ZKO | 2ZZN | 3EQT | 3K5Q | |
| 3L25 | 3MOJ | 3OL6 | 3VYX | 4ERD | 4MDX | 4NGB | 4NKU | |
| 4OOG | 4PMW | 4QVC | 4YVI | 5AWH | 5DNO | 5IP2 | 5UDZ | |
| Independent testing dataset | 1FEU | 1WNE | 1ZDI | 2KXN | 2XB2 | 3AM1 | 3UZS | 3VYY |
| 4CIO | 4GOA | 4JVH | 4NL3 | 5EN1 | 5EV1 | 5HO4 |
Figure 2The R values of the top 26 features.
The performance of the McTWO feature selection algorithm in comparison with four other feature selection algorithms.
| Method | ACC | SENS | SPEC | PRE | F1 | MCC | AUC |
|---|---|---|---|---|---|---|---|
| Boruta | 0.65 | 0.603 | 0.733 | 0.733 | 0.634 | 0.337 | 0.730 |
| mRMR | 0.667 | 0.661 | 0.663 | 0.726 | 0.662 | 0.347 | 0.760 |
| RFE | 0.692 | 0.671 | 0.702 | 0.725 | 0.678 | 0.366 | 0.768 |
| RF | 0.708 | 0.698 | 0.727 | 0.767 | 0.711 | 0.435 | 0.821 |
| Two-step | 0.733 | 0.732 | 0.770 | 0.797 | 0.743 | 0.505 | 0.889 |
mRMR: Minimum redundancy maximum relevance, RFE: Recursive feature elimination, RF: Random forest, ACC: Accuracy, SENS: Sensitivity, SPEC: Specificity, PRE: Precision, F1: F1-score, MCC: Matthew’s correlation coefficient, AUC: Area under the ROC curve.
Figure 3Ranking of feature importance for the six optimal features in terms of F-score.
The F-score of the six optimal features using XGBoost with 10-fold cross-validation over 50 trials.
| Rank | Feature Name | Symbol | F-Score |
|---|---|---|---|
| 1 | RDa | 0.693 | |
| 2 | Closeness | Closeness | 0.679 |
| 3 | Eccentricity | Eccentricity | 0.675 |
| 4 | Enrich conservation | Enrich_conserv | 0.634 |
| 5 | The number of | HSEBD | 0.588 |
| 6 | ASA (relative total_side) | ASA_rts | 0.587 |
Performance comparison of different machine learning methods.
| Method | ACC | SENS | SPEC | PRE | F1 | MCC | AUC |
|---|---|---|---|---|---|---|---|
| RF | 0.710 | 0.650 | 0.781 | 0.779 | 0.690 | 0.430 | 0.783 |
| SVM | 0.741 | 0.738 | 0.741 | 0.775 | 0.741 | 0.480 | 0.802 |
| GTB | 0.740 | 0.728 | 0.755 | 0.784 | 0.739 | 0.481 | 0.810 |
| XGBoost | 0.744 | 0.740 | 0.755 | 0.785 | 0.744 | 0.494 | 0.822 |
SVM: support vector machines, GTB: Gradient Tree Boosting.
Prediction performance of XGBPRH in comparison with PrabHot and HotSPRing on the independent dataset.
| Method | SENS | SPEC | PRE | F1 | MCC | AUC |
|---|---|---|---|---|---|---|
| XGBPRH | 0.909 | 0.733 | 0.833 | 0.870 | 0.661 | 0.868 |
| XGBPRH-50 | 0.880 | 0.537 | 0.739 | 0.802 | 0.454 | 0.817 |
| PrabHot | 0.793 | 0.655 | 0.697 | 0.742 | 0.453 | 0.804 |
| PrabHot-50 | 0.695 | 0.690 | 0.703 | 0.733 | 0.389 | 0.771 |
| HotSPRing | 0.655 | 0.552 | 0.604 | 0.633 | 0.258 | 0.658 |
PrabHot: Prediction of protein–RNA binding hot spots, “-50”: 50 repetitions’ average performance of the proposed method.
Figure 4The ROC curves (receiver operating characteristic curve) of the three approaches on the independent test dataset.
Figure 5The prediction results on 4JVH using XGBPRH method. True positives colored in purple.
Figure 6The prediction results on 1FEU using XGBPRH. True positives are labeled in purple, true negatives are labeled in yellow, and false negatives are labeled in orange.