| Literature DB >> 35832617 |
Ankita Agarwal1,2, Kunal Singh2, Shri Kant2, Ranjit Prasad Bahadur2.
Abstract
RNA-protein interactions play vital roles in driving the cellular machineries. Despite significant involvement in several biological processes, the underlying molecular mechanism of RNA-protein interactions is still elusive. This may be due to the experimental difficulties in solving co-crystallized RNA-protein complexes. Inherent flexibility of RNA molecules to adopt different conformations makes them functionally diverse. Their interactions with protein have implications in RNA disease biology. Thus, study of binding interfaces can provide a mechanistic insight of the molecular functioning and aberrations caused due to altered interactions. Moreover, high-throughput sequencing technologies have generated huge sequence data compared to available structural data of RNA-protein complexes. In such a scenario, efficient computational algorithms are required for identification of protein-binding interfaces of RNA in the absence of known structures. We have investigated several machine learning classifiers and various features derived from nucleotide sequences to identify protein-binding nucleotides in RNA. We achieve best performance with nucleotide-triplet and nucleotide-quartet feature-based random forest models. An overall accuracy of 84.8%, sensitivity of 83.2%, specificity of 86.1%, MCC of 0.70 and AUC of 0.93 is achieved. We have further implemented the developed models in a user-friendly webserver "Nucpred", which is freely accessible at "http://www.csb.iitkgp.ac.in/applications/Nucpred/index".Entities:
Keywords: Machine learning; Protein-binding nucleotides; RNA-protein interactions; Random forest classifier; Stratified cross validation
Year: 2022 PMID: 35832617 PMCID: PMC9249596 DOI: 10.1016/j.csbj.2022.06.036
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Non-redundant training dataset of protein-binding RNAs.
| RNA-type | No. of RNA chains | PDB IDs |
|---|---|---|
| ssRNA | 70 | 1AV6_B, 1C9S_W, 1CVJ_M, 1G2E_B, 1JBS_C, 1JID_B, 1K8W_B, 1KNZ_W, 1KQ2_R, 1LNG_B, 1M5O_E, 1M8V_O, 1M8W_C, 1M8W_E, 1WPU_C, 1WSU_E, 1ZBH_E, 1ZH5_D, 2A8V_E, 2ANR_B, 2ASB_B, 2B3J_E, 2BX2_R, 2DB3_E, 2G4B_B, 2GIC_R, 2IX1_B, 2J0S_E, 2JEA_C, 2JLU_C, 2PY9_E, 2Q66_X, 2R7R_X, 2VNU_B, 2XGJ_C, 2XNR_C, 2XS2_B, 2XZO_D, 3AEV_C, 3BX2_C, 3D2S_E, 3I5X_B, 3IEV_D, 3K5Q_B, 3MDG_C, 3NMR_B, 3O8C_C, 3PF4_R, 3QJJ_Q, 3R2C_R, 3RC8_E, 3T5N_C, 4H5P_E, 4J1G_E, 4J7M_B, 4M59_C, 4M59_D, 4MDX_C, 4N2Q_B, 5AOR_C, 5DET_Q, 5EIM_C, 5ELH_R, 5ELK_R, 5ELR_B, 5ELS_I, 5EX7_B, 5GXH_B, 5I4A_D, 5LTA_E |
| dsRNA | 77 | 1DI2_D, 1HQ1_B, 1MSW_R, 1N35_B, 1N35_C, 1OOA_C, 1R3E_C, 1R9F_B, 1R9F_C, 1SI3_B, 1WNE_B, 1WNE_C, 1YVP_E, 1YVP_F, 1YVP_H, 1ZBI_C, 2AZ0_C, 2AZ0_D, 2BGG_P, 2BGG_Q, 2EZ6_C, 2EZ6_D, 2F8S_C, 2GJW_E, 2GJW_F, 2GJW_H, 2GXB_E, 2GXB_F, 2OZB_C, 2PJP_B, 2QUX_C, 2R8S_R, 2XD0_G, 2Y8W_B, 2YKG_C, 2YKG_D, 2ZI0_C, 2ZKO_C, 2ZKO_D, 3A6P_D, 3A6P_E, 3BSN_P, 3BSN_T, 3BT7_C, 3DH3_F, 3EQT_C, 3EQT_D, 3FTE_C, 3FTE_D, 3IAB_R, 3KS8_E, 3KS8_F, 3MOJ_A, 3O3I_A, 3OIJ_C, 3RW6_H, 3SNP_C, 3ZC0_M, 4ATO_G, 4ERD_C, 4ERD_D, 4FVU_B, 4FVU_C, 4IG8_B, 4IG8_C, 4ILL_C, 4ILL_R, 4L8H_R, 4ZT0_B, 5AOX_C, 5ED1_B, 5ED1_C, 5F5F_B, 5F5H_C, 5ID6_G, 5TF6_B, 5WTK_B |
| tRNA | 33 | 1ASY_R, 1B23_R, 1C0A_B, 1FFY_T, 1GAX_C, 1H3E_B, 1H4S_T, 1J1U_B, 1N78_C, 1QF6_B, 1QTQ_B, 1SER_T, 1U0B_A, 1VFG_D, 2AZX_C, 2BTE_B, 2CSX_C, 2DLC_Y, 2DRB_B, 2DU3_D, 2FK6_R, 2FMT_C, 2ZM5_C, 2ZZM_B, 3ADB_C, 3AMT_B, 3EPH_E, 3HL2_E, 3VJR_B, 4YCP_B, 4YVJ_C, 5HR7_D, 5T8Y_X |
| Ribosomal RNA | 14 | 1DFU_M, 1DFU_N, 1FEU_B, 1FEU_C, 1G1X_D, 1G1X_E, 1I6U_C, 1MJI_C, 1MMS_C, 1MZP_B, 1S03_A, 1SDS_D, 2HW8_B, 5WTY_C |
Performance Evaluation Metrics.
| 1 | Accuracy (ACC) | |
| 2 | Specificity (SPE) | |
| 3 | Sensitivity (SEN) | |
| 4 | Precision (PPV) | |
| 5 | F-measure (F-score) | |
| 6 | Mathews-correlation coefficient (MCC) |
Fig. 1Pipeline for development and prediction of classification model.
Fig. 2Pairwise residue-nucleotide interface composition in training dataset.
Fig. 3Distribution of nucleotide stretches at the interface in training dataset.
Fig. 4Nucleotide compositions (A) NC-singlets, (B) NC-doublets and (C) NC-triplets at interface and non-interface regions in training dataset.
Fig. 5ROC curves at variable window size (5 nt to 30 nt) for (A) BE-feature, (B) NC-singlet, (C) NC-doublet, (D) NC-triplet, (E) NC-quartet and (F) Ensemble-feature based RF models.
10-fold CV results for different RNA features at optimized WS using RF.
| Features | WS | ACC | SPE | SEN | F-score | PPV | MCC | AUC |
|---|---|---|---|---|---|---|---|---|
| Global (SL + NC) | – | 0.68 | 0.77 | 0.57 | 0.61 | 0.67 | 0.35 | 0.73 |
| Local (pKa) | 29 | 0.68 | 0.85 | 0.46 | 0.55 | 0.71 | 0.34 | 0.72 |
| Local (mass) | 29 | 0.67 | 0.85 | 0.45 | 0.55 | 0.70 | 0.33 | 0.72 |
| Local (pKa + mass) | 27 | 0.67 | 0.85 | 0.45 | 0.55 | 0.71 | 0.34 | 0.72 |
| Binary encoding (BE) | 27 | 0.67 | 0.85 | 0.45 | 0.54 | 0.70 | 0.34 | 0.72 |
| NC-singlet | 29 | 0.70 | 0.73 | 0.66 | 0.66 | 0.67 | 0.40 | 0.75 |
| NC-doublet | 23 | 0.84 | 0.86 | 0.82 | 0.82 | 0.82 | 0.69 | 0.92 |
| 23 | 0.85 | 0.86 | 0.84 | 0.83 | 0.83 | |||
| 23 | 0.85 | 0.86 | 0.84 | 0.83 | 0.83 | |||
| Global + Local + BE | 27 | 0.71 | 0.84 | 0.42 | 0.56 | 0.74 | 0.42 | 0.77 |
| Global + Local (SL + NC + pKa + mass) | 23 | 0.72 | 0.84 | 0.58 | 0.65 | 0.75 | 0.44 | 0.78 |
| Ensemble features | 27 | 0.81 | 0.87 | 0.75 | 0.78 | 0.83 | 0.63 | 0.88 |
Best performing features are marked in bold considering AUC and MCC.
10-fold CV results at varying WS for nucleotide triplet composition using RF.
| WS | Accuracy | Specificity | Sensitivity | F-score | PPV | MCC | AUC |
|---|---|---|---|---|---|---|---|
| 5 | 56.35 | 59.11 | 52.92 | 51.92 | 50.96 | 11.99 | 0.58 |
| 7 | 68.78 | 73.03 | 63.50 | 64.44 | 65.40 | 36.65 | 0.72 |
| 9 | 80.13 | 84.42 | 74.78 | 77.02 | 79.39 | 59.62 | 0.85 |
| 11 | 83.06 | 86.08 | 79.30 | 80.66 | 82.06 | 65.63 | 0.89 |
| 13 | 84.12 | 87.01 | 80.51 | 81.86 | 83.27 | 67.77 | 0.91 |
| 15 | 84.45 | 86.48 | 81.92 | 82.43 | 82.95 | 68.48 | 0.92 |
| 17 | 84.17 | 85.75 | 82.21 | 82.22 | 82.24 | 67.96 | 0.92 |
| 19 | 84.24 | 86.15 | 81.87 | 82.23 | 82.59 | 68.08 | 0.92 |
| 21 | 84.34 | 85.81 | 82.50 | 82.43 | 82.36 | 68.30 | 0.93 |
| 84.72 | 85.61 | 84.11 | 83.06 | 82.04 | |||
| 25 | 84.82 | 85.71 | 83.70 | 83.08 | 82.47 | 69.32 | 0.93 |
| 27 | 83.99 | 84.82 | 82.95 | 82.19 | 81.43 | 67.65 | 0.928 |
| 29 | 84.13 | 84.68 | 83.45 | 82.41 | 81.39 | 67.98 | 0.926 |
| 31 | 84.43 | 85.08 | 83.62 | 82.71 | 81.82 | 68.56 | 0.926 |
WS with optimum performance are marked in bold considering AUC and MCC.
10-fold CV results at varying WS for nucleotide quartet composition using RF.
| WS | Accuracy | Specificity | Sensitivity | F-score | PPV | MCC | AUC |
|---|---|---|---|---|---|---|---|
| 5 | 56.83 | 59.84 | 53.09 | 52.28 | 51.49 | 12.90 | 0.58 |
| 7 | 69.40 | 73.69 | 64.12 | 65.14 | 66.18 | 37.95 | 0.74 |
| 9 | 81.23 | 85.25 | 76.23 | 78.35 | 78.35 | 61.88 | 0.87 |
| 11 | 83.34 | 85.95 | 80.09 | 81.07 | 82.07 | 66.21 | 0.91 |
| 13 | 84.21 | 86.31 | 81.58 | 82.15 | 82.72 | 68.00 | 0.92 |
| 15 | 84.80 | 86.78 | 82.33 | 82.83 | 83.33 | 69.20 | 0.92 |
| 17 | 84.72 | 86.55 | 82.46 | 82.78 | 83.11 | 69.06 | 0.93 |
| 19 | 85.00 | 86.38 | 83.28 | 83.18 | 83.08 | 69.65 | 0.93 |
| 21 | 84.91 | 86.11 | 83.41 | 83.12 | 82.83 | 69.48 | 0.93 |
| 85.04 | 86.11 | 83.70 | 83.29 | 82.87 | |||
| 25 | 85.02 | 86.08 | 83.70 | 83.27 | 82.84 | 69.71 | 0.93 |
| 27 | 84.41 | 85.15 | 83.49 | 82.67 | 81.86 | 68.52 | 0.93 |
| 29 | 84.48 | 84.92 | 83.95 | 82.82 | 81.71 | 68.70 | 0.93 |
| 31 | 84.10 | 84.98 | 82.99 | 82.29 | 81.61 | 67.87 | 0.93 |
WS with optimum performance are marked in bold considering AUC and MCC.
Performance measures of NC-triplet RF model for different RNA-types in the dataset.
| RNA-type | Accuracy | Specificity | Sensitivity | F-score | PPV | MCC |
|---|---|---|---|---|---|---|
| tRNA | 0.975 | 0.978 | 0.970 | 0.960 | 0.960 | 0.940 |
| ssRNA | 0.930 | 0.910 | 0.950 | 0.940 | 0.930 | 0.860 |
| dsRNA | 0.950 | 0.930 | 0.960 | 0.950 | 0.930 | 0.890 |
| rRNA | 0.980 | 0.980 | 0.980 | 0.980 | 0.980 | 0.960 |
10-fold CV comparisions for 10 ML algorithms at optimized WS using NC-triplet feature.
| Sl.no. | Algorithm | Optimum WS | ACC | SPE | SEN | F-score | PPV | MCC | AUC |
|---|---|---|---|---|---|---|---|---|---|
| 1. | GNB | 27 | 0.63 | 0.82 | 0.40 | 0.49 | 0.64 | 0.24 | 0.69 |
| 2. | MNB | 23 | 0.62 | 0.70 | 0.52 | 0.55 | 0.58 | 0.22 | 0.68 |
| 3. | 17 | 0.84 | 0.85 | 0.83 | 0.82 | 0.82 | 0.68 | 0.91 | |
| 4. | 25 | 0.84 | 0.85 | 0.81 | 0.81 | 0.81 | 0.67 | 0.92 | |
| 5. | Linear SVM | 29 | 0.72 | 0.87 | 0.53 | 0.62 | 0.76 | 0.42 | 0.80 |
| 6. | 25 | 0.84 | 0.85 | 0.82 | 0.81 | 0.81 | 0.67 | 0.91 | |
| 7. | ADB | 29 | 0.70 | 0.77 | 0.61 | 0.64 | 0.68 | 0.40 | 0.79 |
| 8. | GBT | 27 | 0.79 | 0.85 | 0.71 | 0.75 | 0.80 | 0.59 | 0.88 |
| 9. | 27 | 0.84 | 0.85 | 0.83 | 0.82 | 0.82 | 0.68 | 0.93 | |
| 10. | 23 | 0.85 | 0.87 | 0.84 | 0.83 | 0.83 | 0.70 | 0.93 | |
| 11. | Voting Ensemble | 25 | 0.84 | 0.86 | 0.82 | 0.82 | 0.82 | 0.67 | 0.92 |
Best performing ML classifiers are marked in bold. Voting ensemble is trained using five best performing classifiers marked in bold.
Fig. 6ROC curves obtained for NC-triplet model at optimum window size of 23 nt for 10 different ML classifiers.
Fig. 7Boxplot showing the CV performance of individual and voting ensemble classifiers based on sensitivity.
Fig. 8(A) ROC curve and (B) Confusion matrix obtained with 10-fold CV for NC-triplet RF model at optimum window size of 23 nt.
10-fold CV measures of NC-triplet RF model for different validation and test datasets.
| Dataset | ACC | SPE | SEN | F-score | PPV | MCC | AUC |
|---|---|---|---|---|---|---|---|
| PB-RNA194 | 0.85 | 0.87 | 0.84 | 0.83 | 0.83 | 0.70 | 0.93 |
| RNA-208 | 0.83 | 0.89 | 0.62 | 0.62 | 0.63 | 0.52 | 0.88 |
| RNA-208 (ROS) | 0.83 | 0.86 | 0.71 | 0.64 | 0.58 | 0.53 | 0.87 |
| RNA-150 | 0.91 | 0.95 | 0.80 | 0.81 | 0.84 | 0.75 | 0.95 |
| RNA-344 | 0.87 | 0.90 | 0.81 | 0.81 | 0.82 | 0.71 | 0.94 |
| RNA-30 | 0.83 | 0.89 | 0.70 | 0.68 | 0.69 | 0.62 | 0.89 |
Fig. 9Three-dimensional structure of tRNA (4JXZ_B, 71 nt) in complex with glutaminyl-tRNA synthetase (4JXZ_A). (A) The 34 protein-binding nucleotides, calculated based on SASA from the PDB structure, are shown as red spheres. The rest of the non-binding nucleotides are shown in orange cartoon. (B) The 24 protein-binding (TP) and the 36 non-binding (TN) nucleotides along with one false positive and 10 false negative nucleotides predicted by the developed classifier are represented in red, cyan, yellow and purple spheres, respectively. Protein is represented in green ribbons in both the structures. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)