| Literature DB >> 35328461 |
Huan Zhu1, Chun-Yan Ao1, Yi-Jie Ding2, Hong-Xia Hao1, Liang Yu1.
Abstract
Dihydrouridine (D) is an abundant post-transcriptional modification present in transfer RNA from eukaryotes, bacteria, and archaea. D has contributed to treatments for cancerous diseases. Therefore, the precise detection of D modification sites can enable further understanding of its functional roles. Traditional experimental techniques to identify D are laborious and time-consuming. In addition, there are few computational tools for such analysis. In this study, we utilized eleven sequence-derived feature extraction methods and implemented five popular machine algorithms to identify an optimal model. During data preprocessing, data were partitioned for training and testing. Oversampling was also adopted to reduce the effect of the imbalance between positive and negative samples. The best-performing model was obtained through a combination of random forest and nucleotide chemical property modeling. The optimized model presented high sensitivity and specificity values of 0.9688 and 0.9706 in independent tests, respectively. Our proposed model surpassed published tools in independent tests. Furthermore, a series of validations across several aspects was conducted in order to demonstrate the robustness and reliability of our model.Entities:
Keywords: dihydrouridine; nucleotide chemical properties; oversample; prediction; random forest
Mesh:
Substances:
Year: 2022 PMID: 35328461 PMCID: PMC8950657 DOI: 10.3390/ijms23063044
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1(A) Generation of the training and testing data partition and oversampling. (B) Selection of three features to encode the sequence. (C) Input of feature vectors into classifiers and identification the best combination of feature and classifier. (D) Performance evaluation with a set of metrics.
Performance of different sequence encoding schemes using SVM and RF in independent tests.
| Performance | SVM | RF | ||||||
|---|---|---|---|---|---|---|---|---|
| Sn | Sp | Acc | MCC | Sn | Sp | Acc | MCC | |
| Kmer | 0.2955 | 0.8772 | 0.7152 | 0.2056 | 0.5161 | 0.6176 | 0.5859 | 0.1255 |
| RCKmer | 0.1364 | 0.9298 | 0.7089 | 0.1044 | 0.5263 | 0.8056 | 0.7091 | 0.3415 |
| NAC | 0.1136 |
| 0.7468 | 0.2459 | 0.4063 | 0.7910 | 0.6667 | 0.2072 |
| DNC | 0.2955 | 0.8772 | 0.7152 | 0.2056 | 0.5806 | 0.7794 | 0.7172 | 0.3542 |
| TNC | 0.5682 | 0.8772 | 0.7911 | 0.4630 | 0.7368 | 0.9444 | 0.8727 | 0.7133 |
| ANF | 0.4773 | 0.8947 | 0.7785 | 0.4102 | 0.6316 | 0.8889 | 0.8000 | 0.5449 |
| ENAC |
| 0.9737 |
|
| 0.8947 | 0.9722 | 0.9455 | 0.8786 |
| BINARY | 0.8636 | 0.9474 | 0.9241 | 0.8110 | 0.8065 |
| 0.9394 | 0.8609 |
| NCP | 0.8636 | 0.9474 | 0.9241 | 0.8110 |
| 0.9851 |
|
|
| EIIP | 0.6818 | 0.8860 | 0.8291 | 0.5718 | 0.8125 |
| 0.9394 | 0.8636 |
| PseEIIP | 0.5682 | 0.6754 | 0.6456 | 0.2236 | 0.7368 | 0.9444 | 0.8727 | 0.7133 |
Performance of ENAC, BINARY, and NCP with different testing data partition rates by SVM.
| Performance | Sn | Sp | Acc | MCC | |
|---|---|---|---|---|---|
| Encoding Scheme | Testing Data Partition Rate | ||||
| ENAC | 30% | 0.2623 | 0.9100 | 0.6646 | 0.2308 |
| 20% | 0.6111 | 0.8136 | 0.7368 | 0.4327 | |
|
|
|
|
|
| |
| BINARY | 30% | 0.5738 | 0.9600 | 0.8137 | 0.6044 |
| 20% | 0.5556 | 0.9322 | 0.7895 | 0.5446 | |
|
|
|
|
|
| |
| NCP | 30% | 0.6957 |
| 0.8974 | 0.7482 |
| 20% | 0.7857 | 0.9524 | 0.9011 | 0.7632 | |
|
|
| 0.9722 |
|
| |
Performance of different classifiers with ENAC, BINARY, and NCP.
| Performance | Sn | Sp | Acc | MCC | |
|---|---|---|---|---|---|
| Algorithm | Encoding Scheme | ||||
| RF | ENAC | 0.9375 | 0.9706 | 0.9545 | 0.9093 |
| BINARY | 0.9531 | 0.9559 | 0.9545 | 0.9090 | |
| NCP |
|
|
|
| |
| SVM | ENAC | 0.9063 | 0.8235 | 0.8333 | 0.6670 |
| BINARY | 0.8438 |
| 0.8939 | 0.7882 | |
| NCP |
| 0.8529 |
|
| |
| KNN | ENAC |
| 0.8235 | 0.8939 | 0.7978 |
| BINARY | 0.9531 | 0.7059 | 0.8258 | 0.6764 | |
| NCP |
|
|
|
| |
| LR | ENAC | 0.8594 | 0.8235 | 0.8409 | 0.6827 |
| BINARY |
| 0.8382 | 0.8712 |
| |
| NCP | 0.7500 |
|
| 0.7406 | |
| MLP | ENAC | 0.9219 | 0.7941 | 0.8561 | 0.7197 |
| BINARY | 0.9219 |
| 0.9091 | 0.8186 | |
| NCP |
|
|
|
| |
Figure 2(A) ROC curve under the 5-fold CV. (B) ROC curve under independent test.
Figure 3(A) ROC curve of the 5-fold CV of experiment I. (B) ROC curve of independent tests of experiment I.
Figure 4The radar map showing the performance of experiment II. The species on each corner served as the testing data, while the remaining data were used for training. Different colored radar maps indicate different metrics of performance.
Figure 5The heat map showing the species prediction accuracies (Acc values). The sample of species in the row was used to train, while the sample of species in the column served as testing.
Comparisons between iRNAD, iRNAD_XGBoost, and our current model to identify D modification sites in independent tests.
| Tools | Sn (%) | Sp (%) | Acc (%) | MCC | AUC | Pre (%) | F1 |
|---|---|---|---|---|---|---|---|
|
| 86.11 | 96.05 | 92.86 | 0.83 | 0.98 | N/A | N/A |
|
| 91.67 | 94.74 | 93.75 | 0.86 | 0.87 | 89.19 |
|
|
|
|
|
|
|
|
| 0.85 |
The distribution of D in five species.
| Species |
|
|
|
|
|
|---|---|---|---|---|---|
|
| 29 | 13 | 9 | 91 | 34 |
|
| 68 | 48 | 38 | 93 | 127 |
Chemical properties of each nucleotide [36].
| Chemical Properties | Classes | Nucleotides |
|---|---|---|
| Ring Structure | Pyrimidine | U, C |
| Purine | G, A | |
| Functional Group | Keto | U, G |
| Amino | C, A | |
| Hydrogen Bond | Weak | U, A |
| Strong | G, C |