| Literature DB >> 30157750 |
Jingjing He1, Ting Fang1, Zizheng Zhang1, Bei Huang1, Xiaolei Zhu2, Yi Xiong3.
Abstract
BACKGROUND: Pseudouridylation is the most prevalent type of posttranscriptional modification in various stable RNAs of all organisms, which significantly affects many cellular processes that are regulated by RNA. Thus, accurate identification of pseudouridine (Ψ) sites in RNA will be of great benefit for understanding these cellular processes. Due to the low efficiency and high cost of current available experimental methods, it is highly desirable to develop computational methods for accurately and efficiently detecting Ψ sites in RNA sequences. However, the predictive accuracy of existing computational methods is not satisfactory and still needs improvement.Entities:
Keywords: Nucleotide composition; Position specific nucleotide propensity; Pseudouridine site
Mesh:
Substances:
Year: 2018 PMID: 30157750 PMCID: PMC6114832 DOI: 10.1186/s12859-018-2321-0
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The information of training datasets and independent testing datasets
| Species | The name of training/testing datasetsa | The length of the RNA sequences (bp) | The number of positive samples | The number of the negative samples |
|---|---|---|---|---|
|
| H_990 | 21 | 495 | 495 |
| H_200 | 21 | 100 | 100 | |
|
| S_628 | 31 | 314 | 314 |
| S_200 | 31 | 100 | 100 | |
|
| M_944 | 21 | 472 | 472 |
| – | – | – | – |
aH_900, S_628, M_944 are the training datasets for H. sapiens, S. cerevisiae, M. musculus, respectively; H_200 and S_200 are the independent testing datasets for H. sapiens and S. cerevisiae, respectively
Three types of physicochemical properties of dinucleotides in RNA
| Dinucleotide | Free energy | Hydrophilicity | Stacking energy |
|---|---|---|---|
| GG | −3.260 | 0.170 | −11.100 |
| GA | − 2.350 | 0.100 | − 14.200 |
| GC | −3.420 | 0.260 | −16.900 |
| GU | −2.240 | 0.270 | −13.800 |
| AG | −2.080 | 0.080 | −14.000 |
| AA | −0.930 | 0.040 | −13.700 |
| AC | −2.240 | 0.140 | −13.800 |
| AU | −1.100 | 0.140 | −15.400 |
| CG | −2.360 | 0.350 | −15.600 |
| CA | −2.110 | 0.210 | −14.400 |
| CC | −3.260 | 0.490 | −11.100 |
| CU | −2.080 | 0.520 | −14.000 |
| UG | −2.110 | 0.340 | −14.400 |
| UA | −1.330 | 0.210 | −16.000 |
| UC | −2.350 | 0.480 | −14.200 |
| UU | −0.930 | 0.440 | −13.700 |
More details about the pseudo dinucleotide composition (pseDNC) feature refer to [38]
Fig. 1Flow charts of the jackknife cross validation for features encoded by PSNP or PSDP
The results of feature selection for H_990
| Feature subset | Sen (%) | Spe (%) | Acc (%) | MCC | Kernel scale | Box constraint |
|---|---|---|---|---|---|---|
| NC | 62.83 | 51.31 | 57.07 | 0.1424 | 0.5 | 4 |
| DC | 46.87 | 74.95 | 60.91 | 0.2273 | 2 | 256 |
| pseDNC | 44.24 | 76.57 | 60.40 | 0.2199 | 4 | 1024 |
| PSNP | 66.06 | 60.61 | 63.33 | 0.2671 | 8 | 512 |
| PSDP | 55.15 | 57.17 | 56.16 | 0.1233 | 0.5 | 1024 |
| PSNP+NC | 65.05 | 61.21 | 63.13 | 0.2628 | 1 | 4 |
|
|
|
|
|
|
|
|
| PSNP+pseDNC | 64.44 | 62.42 | 63.43 | 0.2687 | 1 | 8 |
| PSNP+PSDP | 66.26 | 59.39 | 62.83 | 0.2572 | 8 | 1024 |
| PSNP+DC + NC | 64.85 | 63.43 | 64.14 | 0.2829 | 8 | 128 |
| PSNP+DC + pseDNC | 63.03 | 63.23 | 63.13 | 0.2626 | 4 | 32 |
| PSNP+DC + PSDP | 64.24 | 63.43 | 63.84 | 0.2768 | 1 | 2 |
The feature combination with the maximum MCC was italicized in the table
The results of feature selection for S_628
| Feature subset | Sen (%) | Spe (%) | Acc (%) | MCC | Kernel scale | Box constraint |
|---|---|---|---|---|---|---|
| NC | 71.97 | 45.22 | 58.60 | 0.1785 | 1 | 8 |
| DC | 64.33 | 59.87 | 62.10 | 0.2423 | 0.25 | 1 |
| pseDNC | 58.92 | 62.42 | 60.67 | 0.2135 | 0.25 | 0.5 |
| PSNP | 50.96 | 72.93 | 61.94 | 0.2448 | 1 | 0.125 |
| PSDP | 49.36 | 73.57 | 61.46 | 0.2363 | 0.25 | 0.03125 |
| DC + NC | 59.55 | 61.78 | 60.67 | 0.2134 | 4 | 512 |
| DC + pseDNC | 62.42 | 60.51 | 61.46 | 0.2293 | 1 | 1024 |
| DC + PSNP | 63.69 | 65.29 | 64.49 | 0.2898 | 0.5 | 16 |
| DC + PSDP | 60.51 | 66.88 | 63.69 | 0.2744 | 0.125 | 2 |
| DC + PSNP+NC | 61.78 | 65.61 | 63.69 | 0.2741 | 0.25 | 1 |
|
|
|
|
|
|
|
|
| DC + PSNP+PSDP | 63.38 | 67.20 | 65.29 | 0.3060 | 0.25 | 2 |
| DC + PSNP+pseDNC+NC | 61.78 | 65.92 | 63.85 | 0.2773 | 0.25 | 2 |
| DC + PSNP+pseDNC+PSDP | 62.74 | 67.52 | 65.13 | 0.3029 | 0.25 | 4 |
The feature combination with the maximum MCC was italicized in the table
The results of feature selection for M_944
| Feature subset | Sen (%) | Spe (%) | Acc (%) | MCC | Kernel scale | Box constraint |
|---|---|---|---|---|---|---|
| NC | 56.99 | 53.18 | 55.08 | 0.2233 | 2 | 2 |
| DC | 61.86 | 52.75 | 57.31 | 0.1468 | 4 | 1024 |
| pseDNC | 72.46 | 44.28 | 58.37 | 0.1744 | 4 | 128 |
| PSNP | 73.31 | 66.31 | 69.81 | 0.3972 | 0.5 | 1 |
| PSDP | 68.22 | 60.38 | 64.30 | 0.2869 | 1 | 256 |
| PSNP+NC | 69.70 | 70.34 | 70.02 | 0.4004 | 0.25 | 0.125 |
|
|
|
|
|
|
|
|
| PSNP+pseDNC | 74.15 | 66.53 | 70.34 | 0.4080 | 0.5 | 1 |
| PSNP+PSDP | 68.64 | 70.97 | 69.81 | 0.3963 | 0.125 | 0.5 |
| PSNP+DC + NC | 74.15 | 66.10 | 70.13 | 0.4039 | 0.5 | 0.25 |
| PSNP+DC + pseDNC | 73.09 | 67.80 | 70.44 | 0.4095 | 0.5 | 0.5 |
| PSNP+DC + PSDP | 74.58 | 66.31 | 70.44 | 0.4103 | 0.5 | 0.25 |
The feature combination with the maximum MCC was italicized in the table
Fig. 2The ROC curves that show the performances of the five type of features for H.sapiens, S.cerevisiae, and M.musculus, respectively
A comparison of PseUI with iRNA-PseU and re-iRNA-PseU on three training datasets
| Training datasets | Predictor | Sen (%) | Spe (%) | Acc (%) | MCC | AUC |
|---|---|---|---|---|---|---|
| H_990 | iRNA-PseUa | 61.01 | 59.80 | 60.40 | 0.21 | 0.64 |
| re-iRNA-PseUb | 65.05 | 58.79 | 61.92 | 0.24 | 0.65 | |
| PseUIc | 64.85 | 63.64 | 64.24 | 0.28 | 0.68 | |
| S_628 | iRNA-PseUa | 64.65 | 64.33 | 64.49 | 0.29 | 0.81 |
| re-iRNA-PseUb | 66.88 | 64.33 | 65.61 | 0.31 | 0.69 | |
| PseUIc | 62.10 | 71.02 | 66.56 | 0.33 | 0.69 | |
| M_944 | iRNA-PseUa | 73.31 | 64.83 | 69.07 | 0.38 | 0.75 |
| re-iRNA-PseUb | 79.87 | 60.81 | 70.34 | 0.41 | 0.75 | |
| PseUIc | 74.58 | 66.31 | 70.44 | 0.41 | 0.77 |
aThe predictor developed by Chen et al. [14]
bThe predictor we re-implemented by the method proposed by Chen et al. [14]
cThe predictor proposed in this paper
Fig. 3The ROC curves of the best models for H.sapiens, S.cerevisiae, and M.musculus, respectively
A comparison of PseUI with the re-iRNA-PseU on two independent datasets
| Datasets | Predictor | Sen (%) | Spe (%) | Acc (%) | MCC |
|---|---|---|---|---|---|
| H_200 | re-iRNA-PseUa | 58.00 | 65.00 | 61.50 | 0.23 |
| PseUIb | 63.00 | 68.00 | 65.50 | 0.31 | |
| S_200 | re-iRNA-PseUa | 63.00 | 57.00 | 60.00 | 0.20 |
| PseUIb | 72.00 | 65.00 | 68.50 | 0.37 |
aThe predictor we re-implemented by the method proposed by Chen et al. [14]
bThe predictor proposed in this paper