| Literature DB >> 25229688 |
Jun Hu1, Xue He1, Dong-Jun Yu2, Xi-Bei Yang3, Jing-Yu Yang1, Hong-Bin Shen4.
Abstract
Protein-nucleotide interactions are ubiquitous in a wide variety of biological processes. Accurately identifying interaction residues solely from protein sequences is useful for both protein function annotation and drug design, especially in the post-genomic era, as large volumes of protein data have not been functionally annotated. Protein-nucleotide binding residue prediction is a typical imbalanced learning problem, where binding residues are extremely fewer in number than non-binding residues. Alleviating the severity of class imbalance has been demonstrated to be a promising means of improving the prediction performance of a machine-learning-based predictor for class imbalance problems. However, little attention has been paid to the negative impact of class imbalance on protein-nucleotide binding residue prediction. In this study, we propose a new supervised over-sampling algorithm that synthesizes additional minority class samples to address class imbalance. The experimental results from protein-nucleotide interaction datasets demonstrate that the proposed supervised over-sampling algorithm can relieve the severity of class imbalance and help to improve prediction performance. Based on the proposed over-sampling algorithm, a predictor, called TargetSOS, is implemented for protein-nucleotide binding residue prediction. Cross-validation tests and independent validation tests demonstrate the effectiveness of TargetSOS. The web-server and datasets used in this study are freely available at http://www.csbio.sjtu.edu.cn/bioinf/TargetSOS/.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25229688 PMCID: PMC4168127 DOI: 10.1371/journal.pone.0107676
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Compositions of the two benchmark datasets.
| Dataset | Ligand Type | Cross-Validation Dataset (Training Dataset) | Independent Validation Dataset | Total No. of Sequences | ||||
| No. of Sequences | (numP, numN) | Ratio△ | No. of Sequences | (numP, numN) | Ratio△ | |||
| ATP168 | ATP | 168 | (3104, 59226) | 19 | – | – | – | 168 |
| ATP | 227 | (3393, 80409) | 24 | 17 | (248, 6974) | 28 | 244 | |
| ADP | 321 | (4688, 121158) | 26 | 26 | (405, 10553) | 26 | 347 | |
| NUC5 | AMP | 140 | (1756, 44009) | 25 | 20 | (263, 6057) | 23 | 160 |
| GTP | 56 | (875, 21401) | 24 | 7 | (134, 2678) | 20 | 63 | |
| GDP | 105 | (1577, 36561) | 23 | 7 | (94, 2420) | 26 | 112 | |
* Figures numP, numN in 2-tuple (numP, numN) represent the number of positive (binding residues) and negative (non-binding residues) samples, respectively; △ Ratio = numN/numP.
Performance comparisons of with-SOS and without-SOS predictions for ATP168 and ATP227 over five-fold cross-validation under Balanced Evaluation.
| Dataset | Upper-Sampling |
|
|
|
|
|
| ATP168 | with-SOS |
|
|
|
|
|
| without-SOS | 75.2 | 77.2 | 77.1 | 0.262 | 0.843 | |
| ATP227 | with-SOS |
|
|
|
|
|
| without-SOS | 79.0 | 79.1 | 79.1 | 0.266 | 0.871 |
Performance comparisons of with-SOS and without-SOS predictions for ATP168 and ATP227 over five-fold cross-validation under MaxMCC Evaluation.
| Dataset | Upper-Sampling |
|
|
|
|
|
| ATP168 | with-SOS |
|
|
|
|
|
| without-SOS | 35.2 | 98.5 | 95.3 | 0.415 | 0.843 | |
| ATP227 | with-SOS |
|
|
|
|
|
| without-SOS | 40.1 | 98.9 | 96.5 | 0.473 | 0.871 |
Figure 1ROC curves of with-SOS and without-SOS predictions for ATP168 and ATP227 over five-fold cross-validation.
(a) ROC curves for ATP168; (b) ROC curves for ATP227.
Performance comparisons between SOS and ROS, SMOTE, and ADASYN for ATP168 and ATP227 over five-fold cross-validation under MaxMCC Evaluation.
| Dataset | Over-Sampling Method |
|
|
|
|
|
| SOS |
|
|
|
|
| |
| ATP168 | ADASYN | 41.7 | 99.0 | 96.1 | 0.512 | 0.877 |
| SMOTE | 41.4 | 99.0 | 96.1 | 0.511 | 0.860 | |
| ROS | 39.2 | 98.8 | 95.8 | 0.474 | 0.846 | |
| SOS | 46.3 |
|
|
| 0.893 | |
| ATP227 | ADASYN |
| 98.9 | 96.8 | 0.537 |
|
| SMOTE | 44.7 | 99.0 | 96.8 | 0.526 | 0.880 | |
| ROS | 42.9 | 99.1 | 96.9 | 0.522 | 0.876 |
Performance comparisons between the proposed TargetSOS, TargetATP, and TargetATPsite for ATP168 over five-fold cross-validation under Balanced Evaluation.
| Predictor |
|
|
|
|
|
| TargetSOS |
|
|
|
|
|
| TargetATP | 79.1 | 79.8 | 79.8 | 0.308 | 0.873 |
| TargetATPsite | 78.2 | 78.4 | 78.4 | 0.290 | 0.860 |
| ATPint | 74.4 | 75.8 | 75.1 | 0.249 | 0.823 |
Performance comparisons between the proposed TargetSOS and other popular predictors for the NUC5 dataset over five-fold cross-validation under MaxMCC Evaluation.
| Ligand Type | Predictor |
|
|
|
|
|
| TargetSOS |
|
|
|
| 0.893 | |
| TargetATP | 41.2 | 99.0 | 96.6 | 0.501 |
| |
| ATP | TargetATPsite | 44.5 | 98.9 | 96.6 | 0.520 | 0.881 |
| NsitePred | 44.4 | 98.2 | 96.0 | 0.460 | 0.861 | |
| SVMPred | 36.1 | 98.8 | 96.2 | 0.433 | 0.854 | |
| TargetSOS |
| 99.1 |
|
|
| |
| ADP | NsitePred | 54.4 | 98.8 | 97.1 | 0.572 | 0.893 |
| SVMPred | 45.8 |
| 97.3 | 0.555 | 0.885 | |
| TargetSOS |
| 98.8 |
|
|
| |
| AMP | NsitePred | 30.4 | 98.8 | 96.2 | 0.377 | 0.829 |
| SVMPred | 20.8 |
| 96.6 | 0.360 | 0.820 | |
| TargetSOS |
|
|
|
|
| |
| GDP | NsitePred | 64.6 | 99.1 | 97.6 | 0.675 | 0.910 |
| SVMPred | 62.3 | 98.9 | 97.7 | 0.655 | 0.905 | |
| TargetSOS |
| 99.5 |
|
|
| |
| GTP | NsitePred | 47.3 | 99.1 | 96.8 | 0.562 | 0.844 |
| SVMPred | 37.3 |
| 97.0 | 0.551 | 0.836 |
* Data excerpted from [14].
Performance comparisons between the proposed TargetSOS and other popular predictors for the independent validation dataset of NUC5.
| Ligand Type | Predictor |
|
|
|
|
|
| TargetSOS |
|
|
|
|
| |
| TargetATP | 48.9 | 98.9 | 96.9 | 0.542 |
| |
| ATP | TargetATPsite | 45.8 | 99.1 | 97.2 | 0.530 | 0.882 |
| NsitePred | 46.0 | 98.5 | 96.7 | 0.476 | 0.875 | |
| SVMPred | 36.7 | 99.1 | 96.9 | 0.451 | 0.868 | |
| TargetSOS |
| 98.5 | 97.0 |
|
| |
| ADP | NsitePred | 47.4 | 98.7 | 96.8 | 0.512 | 0.893 |
| SVMPred | 38.8 |
|
| 0.500 | 0.886 | |
| TargetSOS |
| 98.9 | 96.7 |
|
| |
| AMP | NsitePred | 42.3 | 98.7 |
| 0.501 | 0.876 |
| SVMPred | 33.5 |
| 96.7 | 0.478 | 0.870 | |
| TargetSOS | 49.1 |
| 97.2 | 0.562 | 0.866 | |
| GDP | NsitePred |
| 98.5 | 97.0 |
|
|
| SVMPred | 51.1 | 98.8 |
| 0.553 | 0.855 | |
| TargetSOS |
| 98.8 |
|
| 0.900 | |
| GTP | NsitePred | 60.4 | 98.8 | 96.9 | 0.640 |
|
| SVMPred | 48.5 |
| 96.9 | 0.602 | 0.887 |
*Data excerpted fdrom [14].