| Literature DB >> 34159192 |
Yanjuan Li1, Zhengnan Zhao1, Zhixia Teng1.
Abstract
As one of important epigenetic modifications, DNA N4-methylcytosine (4mC) plays a crucial role in controlling gene replication, expression, cell cycle, DNA replication, and differentiation. The accurate identification of 4mC sites is necessary to understand biological functions. In the paper, we use ensemble learning to develop a model named i4mC-EL to identify 4mC sites in the mouse genome. Firstly, a multifeature encoding scheme consisting of Kmer and EIIP was adopted to describe the DNA sequences. Secondly, on the basis of the multifeature encoding scheme, we developed a stacked ensemble model, in which four machine learning algorithms, namely, BayesNet, NaiveBayes, LibSVM, and Voted Perceptron, were utilized to implement an ensemble of base classifiers that produce intermediate results as input of the metaclassifier, Logistic. The experimental results on the independent test dataset demonstrate that the overall rate of predictive accurate of i4mC-EL is 82.19%, which is better than the existing methods. The user-friendly website implementing i4mC-EL can be accessed freely at the following.Entities:
Year: 2021 PMID: 34159192 PMCID: PMC8187051 DOI: 10.1155/2021/5515342
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1The framework of i4mC-EL.
The electron-ion interaction pseudopotential values for DNA nucleotides.
| NT | A | C | G | T |
|---|---|---|---|---|
| EIIP | 0.1260 | 0.1340 | 0.0806 | 0.1335 |
Figure 2Working diagram of ensemble learning.
The contrast of performance for dissimilar feature encoding schemes under 10-fold crossvalidation.
| Schemes | ACC | MCC | Sn | Sp |
|---|---|---|---|---|
| BPF | 0.668 | 0.335 | 0.665 | 0.670 |
| DPE | 0.614 | 0.228 | 0.619 | 0.609 |
| RFHC | 0.658 | 0.316 | 0.669 | 0.647 |
| RevKmer | 0.755 | 0.511 | 0.745 | 0.765 |
| PseKNC | 0.794 | 0.589 | 0.786 | 0.803 |
|
| 0.724 | 0.448 | 0.729 | 0.718 |
|
| 0.747 | 0.493 | 0.744 | 0.749 |
| RevKmer+DBE | 0.738 | 0.476 | 0.723 | 0.753 |
| RevKmer+EIIP | 0.779 | 0.558 | 0.764 | 0.794 |
|
| 0.732 | 0.464 | 0.741 | 0.723 |
| Our method | 0.803 | 0.606 | 0.784 | 0.822 |
Figure 3ROC curves for dissimilar feature encoding schemes under 10-fold crossvalidation.
The contrast of performance for dissimilar classifiers under 10-fold crossvalidation.
| Classifiers | ACC | MCC | Sn | Sp |
|---|---|---|---|---|
| BayesNet | 0.727 | 0.453 | 0.739 | 0.714 |
| NaiveBayes | 0.752 | 0.504 | 0.751 | 0.753 |
| SGD | 0.712 | 0.424 | 0.710 | 0.713 |
| SimpleLogistic | 0.761 | 0.522 | 0.753 | 0.768 |
| SMO | 0.702 | 0.405 | 0.706 | 0.698 |
| IBk | 0.637 | 0.276 | 0.584 | 0.690 |
| JRip | 0.707 | 0.414 | 0.692 | 0.723 |
| J48 | 0.665 | 0.330 | 0.674 | 0.655 |
| RandomForest | 0.770 | 0.541 | 0.753 | 0.787 |
| AdaBoostM1 | 0.713 | 0.427 | 0.739 | 0.688 |
| Bagging | 0.729 | 0.459 | 0.744 | 0.714 |
| Our method | 0.803 | 0.606 | 0.784 | 0.822 |
Figure 4ROC curves for dissimilar classifiers under 10-fold crossvalidation.
The contrast of performance for dissimilar feature encoding schemes on TEST-320.
| Schemes | ACC | MCC | Sn | Sp |
|---|---|---|---|---|
| BPF | 0.753 | 0.530 | 0.606 | 0.900 |
| DPE | 0.697 | 0.401 | 0.600 | 0.794 |
| RFHC | 0.716 | 0.438 | 0.631 | 0.800 |
| RevKmer | 0.666 | 0.335 | 0.744 | 0.588 |
| PseKNC | 0.781 | 0.563 | 0.788 | 0.775 |
|
| 0.772 | 0.553 | 0.681 | 0.863 |
|
| 0.800 | 0.614 | 0.694 | 0.906 |
| RevKmer+DBE | 0.756 | 0.516 | 0.700 | 0.813 |
| RevKmer+EIIP | 0.713 | 0.427 | 0.763 | 0.663 |
|
| 0.772 | 0.553 | 0.681 | 0.863 |
| Ourmethod | 0.822 | 0.644 | 0.806 | 0.838 |
Figure 5ROC curves for dissimilar feature encoding schemes on TEST-320.
The contrast of performance for dissimilar classifiers on TEST-320.
| Classifiers | ACC | MCC | Sn | Sp |
|---|---|---|---|---|
| BayesNet | 0.769 | 0.547 | 0.675 | 0.863 |
| NaiveBayes | 0.788 | 0.577 | 0.744 | 0.831 |
| SGD | 0.688 | 0.379 | 0.756 | 0.619 |
| Simple Logistic | 0.728 | 0.456 | 0.738 | 0.719 |
| SMO | 0.675 | 0.353 | 0.744 | 0.606 |
| IBk | 0.600 | 0.201 | 0.563 | 0.638 |
| JRip | 0.769 | 0.541 | 0.713 | 0.825 |
| J48 | 0.663 | 0.325 | 0.656 | 0.669 |
| Random Forest | 0.778 | 0.558 | 0.738 | 0.819 |
| AdaBoostM1 | 0.791 | 0.581 | 0.794 | 0.788 |
| Bagging | 0.781 | 0.564 | 0.744 | 0.819 |
| Our method | 0.822 | 0.644 | 0.806 | 0.838 |
Figure 6ROC curves for dissimilar classifiers on TEST-320.
The contrast of performance for dissimilar models on TEST-320.
| Models | ACC | MCC | Sn | Sp |
|---|---|---|---|---|
| 4mcPred-EL | 0.791 | 0.584 | 0.757 | 0.825 |
| i4mC-Mouse | 0.816 | 0.633 | 0.807 | 0.825 |
| i4mC-EL | 0.822 | 0.644 | 0.806 | 0.838 |