| Literature DB >> 31888464 |
Thanh-Hoang Nguyen-Vo1, Quang H Nguyen2, Trang T T Do3, Thien-Ngan Nguyen4, Susanto Rahardja5, Binh P Nguyen6.
Abstract
BACKGROUND: Pseudouridine modification is most commonly found among various kinds of RNA modification occurred in both prokaryotes and eukaryotes. This biochemical event has been proved to occur in multiple types of RNAs, including rRNA, mRNA, tRNA, and nuclear/nucleolar RNA. Hence, gaining a holistic understanding of pseudouridine modification can contribute to the development of drug discovery and gene therapies. Although some laboratory techniques have come up with moderately good outcomes in pseudouridine identification, they are costly and required skilled work experience. We propose iPseU-NCP - an efficient computational framework to predict pseudouridine sites using the Random Forest (RF) algorithm combined with nucleotide chemical properties (NCP) generated from RNA sequences. The benchmark dataset collected from Chen et al. (2016) was used to develop iPseU-NCP and fairly compare its performances with other methods.Entities:
Keywords: Identification; NCP; Pseudouridine site; RNA; Random forest; Uridine
Mesh:
Substances:
Year: 2019 PMID: 31888464 PMCID: PMC6936030 DOI: 10.1186/s12864-019-6357-y
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Sequence characteristics of ψ sites across the three species, including (a) H. sapiens, bS. cerevisiae, and cM. musculus. Sequence analysis using logo representations were created by Two Sample Logo with t-test (p<0.05)
Fig. 2The processing steps in the proposed framework
Comparative analysis on our RF model using different encoding schemes under 5-fold cross-validation on different development datasets
| Dataset | Encoding Scheme | ACC (%) | SN (%) | SP(%) | MCC |
|---|---|---|---|---|---|
| PseKNC | 59.39 | 69.49 | 49.29 | 0.19 | |
| CKSNAP | 60.00 | 36.16 | 0.23 | ||
| NCP | 58.79 | ||||
| PseKNC | 58.76 | 51.91 | 0.18 | ||
| CKSNAP | 60.03 | 56.37 | 63.69 | 0.20 | |
| NCP | 62.10 | ||||
| PseKNC | 56.57 | 44.49 | 68.64 | 0.14 | |
| CKSNAP | 57.52 | 52.54 | 62.50 | 0.15 | |
| NCP |
Values which are significantly higher than the others are in bold
Fig. 3Feature importances of the three predictive models for (a) H. sapiens, bS. cerevisiae, and cM. musculus
Fig. 4Heatmap indicating the 5-fold cross-validation accuracy with different combinations of max_depth and max_features across the three development datasets, including a H_990, b S_628, and c M_944
Comparative analysis between results of the proposed method and other studies using 5-fold cross-validation
| Dataset | Model | ACC (%) | SN (%) | SP (%) | MCC | Method |
|---|---|---|---|---|---|---|
| iRNA-PseU | 60.40 | 61.01 | 59.80 | 0.21 | Chen et al., 2016 | |
| PseUI | 64.24 | 64.85 | 63.64 | 0.28 | He et al., 2018 | |
| iPseU-CNN | Tahir et al., 2019 | |||||
| iPseU-NCP | 62.92 | 58.79 | 65.05 | 0.24 | ||
| iRNA-PseU | 64.49 | 64.65 | 64.33 | 0.29 | Chen et al., 2016 | |
| PseUI | 65.13 | 62.74 | 67.52 | 0.30 | He et al., 2018 | |
| iPseU-CNN | 68.15 | 66.36 | 0.37 | Tahir et al., 2019 | ||
| iPseU-NCP | 62.10 | |||||
| iRNA-PseU | 69.07 | 73.31 | 64.83 | 0.38 | Chen et al., 2016 | |
| PseUI | 70.44 | 70.34 | 0.41 | He et al., 2018 | ||
| iPseU-CNN | 71.81 | 74.79 | 69.11 | 0.44 | Tahir et al., 2019 | |
| iPseU-NCP | 67.37 |
Values which are significantly higher than the others are in bold. Data excerpted from [16]
Comparative analysis between results of the proposed method and other studies on the independent test sets
| Dataset | Model | ACC (%) | SN (%) | SP (%) | MCC | Method |
|---|---|---|---|---|---|---|
| iRNA-PseU | 61.50 | 58.00 | 65.00 | 0.23 | Chen et al., 2016 | |
| PseUI | 65.50 | 63.00 | 68.00 | 0.31 | He et al., 2018 | |
| iPseU-CNN | 69.00 | 60.81 | 0.40 | Tahir et al., 2019 | ||
| iPseU-NCP | 70.00 | |||||
| iRNA-PseU | 60.00 | 63.00 | 57.00 | 0.20 | Chen et al., 2016 | |
| PseUI | 68.50 | 65.00 | 72.00 | 0.37 | He et al., 2018 | |
| iPseU-CNN | 73.50 | 68.76 | 0.47 | Tahir et al., 2019 | ||
| iPseU-NCP | 77.00 |
Values which are significantly higher than the others are in bold. Data excerpted from [16]
Fig. 5A snapshot of the iPseU-NCP web-server
Data distribution of the training sets and the independent test sets
| Dataset | Number of samples | Species | Group | ||
|---|---|---|---|---|---|
| Possitive | Negative | Total | |||
| S_628 | 314 | 314 | 628 | Training (Development) | |
| M_944 | 472 | 472 | 944 | ||
| H_990 | 495 | 495 | 990 | ||
| S_200 | 100 | 100 | 200 | Independent Test | |
| H_200 | 100 | 100 | 200 | ||
NCP-encoding scheme
| Cyclic Structure | Purine | 1 | A, G |
| Pyrimidine | 0 | C, U | |
| Functional Group | Amino | 1 | A, C |
| Keto | 0 | G, U | |
| Hydrogen Bond | Weak | 1 | A, U |
| Strong | 0 | C, G |