| Literature DB >> 32322372 |
Md Mehedi Hasan1,2, Balachandran Manavalan3, Watshara Shoombuatong4, Mst Shamima Khatun1, Hiroyuki Kurata1,5.
Abstract
N4-methylcytosine (4mC) is one of the most important DNA modifications and involved in regulating cell differentiations and gene expressions. The accurate identification of 4mC sites is necessary to understand various biological functions. In this work, we developed a new computational predictor called i4mC-Mouse to identify 4mC sites in the mouse genome. Herein, six encoding schemes of k-space nucleotide composition (KSNC), k-mer nucleotide composition (Kmer), mono nucleotide binary encoding (MBE), dinucleotide binary encoding, electron-ion interaction pseudo potentials (EIIP) and dinucleotide physicochemical composition were explored that cover different characteristics of DNA sequence information. Subsequently, we built six RF-based encoding models and then linearly combined their probability scores to construct the final predictor. Among the six RF-based models, the Kmer, KSNC, MBE, and EIIP encodings are sufficient, which contributed to 10%, 45%, 25%, and 20% of the prediction performance, respectively. On the independent test the i4mC-Mouse predicted the 4mC sites with accuracy and MCC of 0.816 and 0.633, respectively, which were approximately 2.5% and 5% higher than those of the existing method (4mCpred-EL). For experimental biologists, a freely available web application was implemented at http://kurata14.bio.kyutech.ac.jp/i4mC-Mouse/.Entities:
Keywords: Machine learning; Mouse genome; Sequence analysis; Sequence encoding
Year: 2020 PMID: 32322372 PMCID: PMC7168350 DOI: 10.1016/j.csbj.2020.04.001
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1A computational framework of the i4mC-Mouse. It includes three steps: (i) dataset construction; (ii) selection of six different encoding schemes that convert DNA sequences into numerical feature vectors; and (iii) model evaluation and construction using a CV test. Then, construction of a webserver for the final prediction model (i4mC-Mouse).
Fig. 2Sequence logo representation of 4mC samples. The 20 upstream and 20 downstream DNA residues surrounding the mouse 4mC site were analyzed.
Fig. 3Performance comparisons of single encoding-based models and i4mC-Mouse. The ROC curves were evaluated on the training dataset by a 10-fold CV test (A) and independent dataset (B).
Prediction performances of the i4mC-Mouse model and the single encoding-based RF models.
| Methods | MCC | Ac (%) | Sn (%) | Sp (%) | AUC | P-value |
|---|---|---|---|---|---|---|
| Kmer | 0.566 | 74.81 | 59.53 | 90.10 | 0.869 | 0.011 |
| KSNC | 0.602 | 76.90 | 63.42 | 90.30 | 0.882 | 0.063 |
| MBE | 0.486 | 71.20 | 53.81 | 88.61 | 0.851 | 0.006 |
| DBE | 0.432 | 69.13 | 48.11 | 90.10 | 0.814 | 0.001 |
| EIIP | 0.473 | 70.80 | 52.31 | 89.21 | 0.840 | 0.001 |
| DPC | 0.428 | 69.21 | 49.91 | 88.52 | 0.822 | 0.001 |
| i4mC-Mouse | 0.651 | 79.30 | 68.31 | 90.20 | 0.904 | – |
* i4mC-Mouse specifies the linear arrangement of the RF scores for Kmer, KSNC, MBE, DBE, EIIP, and DPC encodings and their weight values are 0.10, 0.45, 0.25, 0.00, 0.20, and 0.00, respectively.
Fig. 4Effect of different ML algorithms on the AUC values of the six single encoding-based models and i4mC-Mouse. The performances were evaluated on the training datasets by a 10-fold CV test.
Comparison between the i4mC-Mouse and 4mCpred-EL.
| Method | MCC | Ac (%) | Sn (%) | Sp (%) | AUC |
|---|---|---|---|---|---|
| 4mCpred-EL | 0.584 | 79.10 | 75.72 | 82.51 | 0.881 |
| i4mC-Mouse | 0.633 | 81.61 | 80.71 | 82.52 | 0.920 |
The performances were evaluated on the independent dataset.