| Literature DB >> 31542696 |
Shaherin Basith1, Balachandran Manavalan1, Tae Hwan Shin1, Gwang Lee2.
Abstract
DNA N6-adenine methylation (6mA) is an epigenetic modification in prokaryotes and eukaryotes. Identifying 6mA sites in rice genome is important in rice epigenetics and breeding, but non-random distribution and biological functions of these sites remain unclear. Several machine-learning tools can identify 6mA sites but show limited prediction accuracy, which limits their usability in epigenetic research. Here, we developed a novel computational predictor, called the Sequence-based DNA N6-methyladenine predictor (SDM6A), which is a two-layer ensemble approach for identifying 6mA sites in the rice genome. Unlike existing methods, which are based on single models with basic features, SDM6A explores various features, and five encoding methods were identified as appropriate for this problem. Subsequently, an optimal feature set was identified from encodings, and corresponding models were developed individually using support vector machine and extremely randomized tree. First, all five single models were integrated via ensemble approach to define the class for each classifier. Second, two classifiers were integrated to generate a final prediction. SDM6A achieved robust performance on cross-validation and independent evaluation, with average accuracy and Matthews correlation coefficient (MCC) of 88.2% and 0.764, respectively. Corresponding metrics were 4.7%-11.0% and 2.3%-5.5% higher than those of existing methods, respectively. A user-friendly, publicly accessible web server (http://thegleelab.org/SDM6A) was implemented to predict novel putative 6mA sites in rice genome.Entities:
Keywords: DNA N(6)-adenine methylation; extremely randomized tree; machine learning; rice genome; support vector machine
Year: 2019 PMID: 31542696 PMCID: PMC6796762 DOI: 10.1016/j.omtn.2019.08.011
Source DB: PubMed Journal: Mol Ther Nucleic Acids ISSN: 2162-2531 Impact factor: 8.886
Figure 1Overall Framework of SDM6A
The four major steps include: (1) data collection and pre-processing, (2) feature extraction and optimization using two-step feature selection protocol, (3) parameter optimization and construction of ensemble model, and (4) performance assessment and web server development.
Figure 2Performance of Four Different ML Classifiers with Respect to Using Five Feature Encodings to Distinguish between 6mA Sites and non-6mA Sites
(A) RF, (B) ERT, (C) XGB, and (D) SVM.
Figure 3Absolute Differences in Accuracy (ΔACC) Computed between the Accuracy Obtained from 10-fold Cross-Validation and Independent Evaluation for Each Classifier with Respect to Different Encodings
Figure 4Sequential Forward Search for Discriminating between 6mA Sites and Non-6mA Sites
The x axis corresponds to the feature dimension and y axis represents its performance in terms of accuracy. The maximum accuracy obtained via 10-fold cross-validation is shown for each feature encoding. (A) RF, (B) ERT, (C) SVM, and (D) XGB.
Figure 5Comparison of Original Features and Optimal Features in Terms of Performance and Feature Dimension
(A) Percentage of improvement for each optimal feature-set encoding with respect to four different classifiers. (B) Comparison of original feature and optimal feature dimension for each feature encoding with respect to four different classifiers.
Performance Comparison of Different Single Method-Based Models and a Selection of Ensemble Models for Predicting 6mA Sites on the Benchmarking Dataset
| Method | MCC | ACC | SN | SP | AUC |
|---|---|---|---|---|---|
| 1. RF | 0.759 | 0.878 | 0.840 | 0.917 | 0.942 |
| 2.ERT | 0.759 | 0.878 | 0.844 | 0.913 | 0.946 |
| 3. SVM | 0.751 | 0.875 | 0.852 | 0.898 | 0.935 |
| 4. XGB | 0.748 | 0.874 | 0.860 | 0.889 | 0.947 |
| {1, 2} | 0.760 | 0.880 | 0.872 | 0.889 | 0.945 |
| {2, 3} | 0.763 | 0.881 | 0.852 | 0.909 | 0.940 |
| {3, 4} | 0.757 | 0.878 | 0.874 | 0.883 | 0.944 |
| {1, 3} | 0.753 | 0.877 | 0.870 | 0.883 | 0.939 |
| {1, 4} | 0.756 | 0.878 | 0.873 | 0.883 | 0.947 |
| {2, 4} | 0.758 | 0.879 | 0.875 | 0.883 | 0.948 |
| {1, 2, 3} | 0.760 | 0.880 | 0.860 | 0.899 | 0.942 |
| {2, 3, 4} | 0.758 | 0.879 | 0.875 | 0.883 | 0.945 |
| {1, 3, 4} | 0.757 | 0.878 | 0.859 | 0.898 | 0.944 |
| {1, 2, 4} | 0.758 | 0.879 | 0.874 | 0.884 | 0.947 |
| {1, 2, 3, 4} | 0.756 | 0.878 | 0.865 | 0.891 | 0.945 |
| i6mA-Pred | 0.670 | 0.835 | 0.834 | 0.836 | 0.909 |
| iDNA6mA | 0.732 | 0.867 | 0.866 | 0.866 | 0.931 |
The first column represents a single method-based model or an ensemble model, which was built based on combining different single models. For instance, “1. RF” stands for the prediction model developed on RF, while “{1, 2}” means for ensemble model that is built based on single models numbered “1” and “2.” Abbreviations are as follows: MCC, Matthews correlation coefficient; ACC, accuracy; SN, sensitivity; SP, specificity; and AUC, area under the curve.
The optimal model was selected by systematically examining all possible random combinations.
The existing method used for the comparison, whose metrics are taken from the corresponding references.18, 19
Performance Comparison of Various Single Method-Based Models and a Selected Ensemble Model for Predicting m6A Sites on the Independent Dataset
| Method | MCC | ACC | SN | SP | AUC |
|---|---|---|---|---|---|
| 1. RF | 0.715 | 0.858 | 0.842 | 0.873 | 0.923 |
| 2.ERT | 0.729 | 0.864 | 0.855 | 0.873 | 0.934 |
| 3. SVM | 0.742 | 0.871 | 0.873 | 0.869 | 0.936 |
| 4. XGB | 0.721 | 0.860 | 0.891 | 0.828 | 0.939 |
| {1, 2} | 0.726 | 0.862 | 0.896 | 0.828 | 0.931 |
| {2, 3} | 0.769 | 0.885 | 0.878 | 0.891 | 0.938 |
| {3, 4} | 0.757 | 0.878 | 0.905 | 0.851 | 0.940 |
| {1, 3} | 0.743 | 0.871 | 0.891 | 0.851 | 0.935 |
| {1, 4} | 0.731 | 0.864 | 0.905 | 0.824 | 0.936 |
| {2, 4} | 0.731 | 0.864 | 0.905 | 0.824 | 0.938 |
| {1, 2, 3} | 0.751 | 0.876 | 0.887 | 0.864 | 0.936 |
| {2, 3, 4} | 0.744 | 0.871 | 0.905 | 0.837 | 0.939 |
| {1, 3, 4} | 0.760 | 0.880 | 0.891 | 0.869 | 0.938 |
| {1, 2, 4} | 0.735 | 0.867 | 0.905 | 0.828 | 0.936 |
| {1, 2, 3, 4} | 0.731 | 0.864 | 0.905 | 0.824 | 0.936 |
The first column represents a single method-based model or an ensemble model, which was built based on combining different single models (see Table 1 legend for more information).
The best performance obtained by the optimal model.
Performances of the Proposed Method and Two State-of-Art Predictors on Independent Dataset
| Method | MCC | ACC | SN | SP | AUC | p Value |
|---|---|---|---|---|---|---|
| SDM6A | 0.765 | 0.882 | 0.878 | 0.887 | 0.938 | — |
| i6mA-Pred | 0.647 | 0.824 | 0.819 | 0.828 | NA | < 0.0001 |
| iDNA6mA | 0.702 | 0.851 | 0.873 | 0.828 | NA | < 0.0001 |
The first column represents the method evaluated in this study. Because i6mA-Pred and iDNA6mA did not provide predicted probability value, AUC value cannot be computed.
A p value < 0.001 was considered to indicate a statistically significant difference between SDM6A and the selected method.