| Literature DB >> 26543860 |
Xin Ma1, Jing Guo2, Xiao Sun2.
Abstract
The prediction of RNA-binding proteins is one of the most challenging problems in computation biology. Although some studies have investigated this problem, the accuracy of prediction is still not sufficient. In this study, a highly accurate method was developed to predict RNA-binding proteins from amino acid sequences using random forests with the minimum redundancy maximum relevance (mRMR) method, followed by incremental feature selection (IFS). We incorporated features of conjoint triad features and three novel features: binding propensity (BP), nonbinding propensity (NBP), and evolutionary information combined with physicochemical properties (EIPP). The results showed that these novel features have important roles in improving the performance of the predictor. Using the mRMR-IFS method, our predictor achieved the best performance (86.62% accuracy and 0.737 Matthews correlation coefficient). High prediction accuracy and successful prediction performance suggested that our method can be a useful approach to identify RNA-binding proteins from sequence information.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26543860 PMCID: PMC4620426 DOI: 10.1155/2015/425810
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
The prediction performance of the RF model based on various features, evaluated by 10 cycles of 5-fold cross-validation on the MDset dataset.
| Feature | Accuracy ± SD | Sensitivity ± SD | Specificity ± SD | MCC ± SD |
|---|---|---|---|---|
| PSSM-400 | 0.7967 ± 0.0062 | 0.7003 ± 0.0093 | 0.8894 ± 0.0075 | 0.620 ± 0.016 |
| EIPP | 0.8311 ± 0.0105 | 0.7487 ± 0.0071 | 0.9107 ± 0.0129 | 0.662 ± 0.021 |
| CT | 0.7482 ± 0.0092 | 0.6591 ± 0.0067 | 0.8406 ± 0.0153 | 0.5096 ± 0.015 |
| EIPP + BP + NBP | 0.8428 ± 0.0038 | 0.7573 ± 0.0082 | 0.9367 ± 0.0043 | 0.704 ± 0.008 |
| CT + BP + NBP | 0.7661 ± 0.0197 | 0.7034 ± 0.0132 | 0.8587 ± 0.0114 | 0.568 ± 0.026 |
| EIPP + CT | 0.8317 ± 0.0139 | 0.7482 ± 0.0068 | 0.9202 ± 0.0127 | 0.671 ± 0.018 |
| EIPP + BP + NBP + CT | 0.8573 ± 0.0117 | 0.7764 ± 0.0143 | 0.9424 ± 0.0062 | 0.729 ± 0.020 |
Figure 1The IFS curve showing MCC values against feature numbers. The maximum MCC value was 0.684 when the top 47 features were selected.
Optimal 47 features for prediction of RNA-binding proteins.
| Rank | Feature |
|---|---|
| 1 | EIPP of ASP in protein sequence for the pKa values of amino group |
| 2 | EIPP of GLU in protein sequence for the Balaban index |
| 3 | BP(2) |
| 4 | EIPP of TYR in protein sequence for the pKa values of amino group |
| 5 | CT of class a, class b, and class e |
| 6 | CT of class d, class b, and class e |
|
| |
| 7 | EIPP of HIS in protein sequence for the pKa values of amino group |
| 8 | EIPP of LYS in protein sequence for the pKa values of carboxyl group |
| 9 | CT of class b, class d, and class e |
| 10 | CT of class d, class c, and class e |
| 11 | EIPP of MET in protein sequence for the molecular mass |
| 12 | CT of class b, class e, and class a |
| 13 | EIPP of ARG in protein sequence for the pKa values of amino group |
| 14 | NBP(2) |
| 15 | CT of class c, class e, and class d |
| 16 | BP(1) |
| 17 | EIPP of TRP in protein sequence for the pKa values of amino group |
| 18 | CT of class d, class d, and class e |
| 19 | EIPP of LYS in protein sequence for the Balaban index |
| 20 | NBP(1) |
| 21 | CT of class c, class a, and class d |
| 22 | CT of class b, class e, and class d |
| 23 | CT of class e, class d, and class e |
| 24 | EIPP of HIS in protein sequence for the pKa values of carboxyl group |
| 25 | CT of class d, class c, and class f |
| 26 | CT of class e, class f, and class d |
| 27 | CT of class e, class b, and class d |
| 28 | CT of class d, class e, and class c |
| 29 | EIPP of GLY in protein sequence for the pKa values of carboxyl group |
| 30 | EIPP of THR in protein sequence for the molecular mass |
| 31 | CT of class c, class b, and class e |
| 32 | CT of class c, class e, and class a |
| 33 | EIPP of GLN in protein sequence for Wiener index |
| 34 | EIPP of SER in protein sequence for Wiener index |
| 35 | EIPP of ASN in protein sequence for the molecular mass |
| 36 | CT of class b, class a, and class c |
| 37 | CT of class e, class d, and class f |
| 38 | CT of class e, class b, and class a |
| 39 | EIPP of TRP in protein sequence for the pKa values of carboxyl group |
| 40 | CT of class a, class e, and class c |
| 41 | EIPP of ARG in protein sequence for the lowest free energy |
| 42 | CT of class e, class c, and class d |
| 43 | EIPP of LYS in protein sequence for the molecular mass |
|
| |
| 44 | CT of class e, class e, and class d |
| 45 | EIPP of TYR in protein sequence for Wiener index |
| 46 | CT of class e, class c, and class b |
| 47 | CT of class f, class c, and class d |
Figure 2(a) Feature distribution for the 47 optimal features. (b) The selection proportion of each type of feature.
Figure 3(a) Physicochemical property distribution to construct the 19 EIPP features that were selected in the optimal feature set. (b) The type of amino acids distribution to construct the 19 EIPP features that were selected in the optimal feature set.
Figure 4The type of class distribution to construct the 24 CT features that were selected in the optimal feature set.
Comparison of the predicted results by our method and some webservers on the Testset.
| Method | ACC (%) | SE (%) | SP (%) | MCC |
|---|---|---|---|---|
| Our method | 0.7674 | 0.7222 | 0.8125 | 0.537 |
| SVMprot | 0.5764 | 0.7639 | 0.3889 | 0.165 |
| RNApred | 0.6111 | 0.6389 | 0.5833 | 0.223 |