| Literature DB >> 31511570 |
Ze Liu1,2, Wei Dong3,4, Wei Jiang1,2, Zili He1,2.
Abstract
DNA N6-methyldeoxyadenosine (6 mA) modifications were first found more than 60 years ago but were thought to be only widespread in prokaryotes and unicellular eukaryotes. With the development of high-throughput sequencing technology, 6 mA modifications were found in different multicellular eukaryotes by using experimental methods. However, the experimental methods were time-consuming and costly, which makes it is very necessary to develop computational methods instead. In this study, a machine learning-based prediction tool, named csDMA, was developed for predicting 6 mA modifications. Firstly, three feature encoding schemes, Motif, Kmer, and Binary, were used to generate the feature matrix. Secondly, different algorithms were selected into the prediction model and the ExtraTrees model received the best AUC of 0.878 by using 5-fold cross-validation on the training dataset. Besides, the ExtraTrees model also received the best AUC of 0.893 on the independent testing dataset. Finally, we compared our method with state-of-the-art predictors and the results shown that our model achieved better performance than existing tools.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31511570 PMCID: PMC6739324 DOI: 10.1038/s41598-019-49430-4
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Reduce sequence redundancy in the different datasets by using the CD-HIT-EST software.
| Species | Dataset | Sequence identity threshold | |||
|---|---|---|---|---|---|
| 0.95 | 0.90 | 0.85 | 0.80 | ||
| Mouse | Positive | 1,931 | 1,924 | 1,914 | 1,892 |
| Negative | 1,885 | 1,866 | 1,844 | 1,836 | |
| Rice | Positive | 880 | 879 | 878 | 876 |
| Negative | 880 | 880 | 880 | 880 | |
| cross-species | Positive | 2,811 | 2,803 | 2,792 | 2,768 |
| Negative | 2,767 | 2,746 | 2,724 | 2,716 | |
Figure 1The framework of csDMA.
Figure 2Model performance based on the different feature subsets. 1,000 decision trees were selected into the Random Forest classifier and 5-fold cross-validation was used to evaluate the performance of csDMA.
Figure 3The model performance of different classifiers. The Motif, Kmer, and Binary feature subsets were selected into each classifier and the optimized parameters were used for model training. To evaluate the performance of each classifier, 5-fold cross-validation was used and Standard measures such as ACC, Sn and Sp were used to evaluate the performance of our model.
Model performance of each algorithm on the training dataset.
| Algorithm | Sn | Sp | ACC | MCC | AUC | F1 |
|---|---|---|---|---|---|---|
| RandomForest | 0.853 | 0.735 | 0.794 | 0.593 | 0.871 | 0.806 |
| GradientBoosting | 0.743 | 0.762 | 0.752 | 0.506 | 0.818 | 0.750 |
| AdaBoost | 0.713 | 0.718 | 0.715 | 0.431 | 0.777 | 0.715 |
| ExtraTrees |
| 0.735 |
|
|
|
|
| SVM | 0.807 |
| 0.785 | 0.572 | 0.858 | 0.790 |
The highest value of each column is marked in bold.
Model performance of the different algorithms on the independent testing dataset.
| Algorithm | Sn | Sp | ACC | MCC | AUC | F1 |
|---|---|---|---|---|---|---|
| RandomForest | 0.875 | 0.747 |
|
| 0.884 |
|
| GradientBoosting | 0.765 | 0.757 | 0.761 | 0.522 | 0.854 | 0.771 |
| AdaBoost | 0.776 | 0.719 | 0.749 | 0.496 | 0.814 | 0.764 |
| ExtraTrees |
| 0.729 | 0.813 | 0.628 |
|
|
| SVM | 0.843 |
| 0.804 | 0.607 | 0.875 | 0.819 |
The highest value of each column is marked in bold.
Figure 4Performance comparison of csDMA and iDNA6mA-PseKNC. (A) The ROC curves of csDMA and iDNA6mA-PseKNC. (B) The Precision-Recall curves of csDMA and iDNA6mA-PseKNC.
Model performance of each algorithm across species.
| Algorithm | Species | Sn | Sp | ACC | MCC | AUC | F1 |
|---|---|---|---|---|---|---|---|
| csDMA | Cross-species | 0.863 | 0.735 | 0.799 | 0.603 | 0.879 | 0.811 |
| Rice | 0.842 | 0.880 | 0.861 | 0.723 | 0.923 | 0.858 | |
|
| 0.932 | 1 | 0.966 | 0.935 | 0.974 | 0.965 | |
| iDNA6mA-PseKNC | Cross-species | 0.762 | 0.769 | 0.765 | 0.531 | 0.844 | 0.764 |
| Rice | 0.569 | 0.721 | 0.641 | 0.394 | 0.896 | 0.543 | |
|
| 0.869 | 1 | 0.935 | 0.877 | 0.974 | 0.930 |