| Literature DB >> 32917152 |
Zhixun Zhao1, Xiaocai Zhang1, Fang Chen2, Liang Fang3, Jinyan Li4.
Abstract
BACKGROUND: DNA N4-methylcytosine (4mC) is a critical epigenetic modification and has various roles in the restriction-modification system. Due to the high cost of experimental laboratory detection, computational methods using sequence characteristics and machine learning algorithms have been explored to identify 4mC sites from DNA sequences. However, state-of-the-art methods have limited performance because of the lack of effective sequence features and the ad hoc choice of learning algorithms to cope with this problem. This paper is aimed to propose new sequence feature space and a machine learning algorithm with feature selection scheme to address the problem.Entities:
Keywords: DNA N4-methylcytosine; Feature selection; Sequence feature; Site prediction
Mesh:
Substances:
Year: 2020 PMID: 32917152 PMCID: PMC7488740 DOI: 10.1186/s12864-020-07033-8
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Framework of proposed model construction
Fig. 2Sequence feature importance distribution
The independent test performance before and after feature selection(Sn, Sp and ACC:%)
| Datasets | Selection | Sn | Sp | ACC | MCC |
|---|---|---|---|---|---|
| C.elegans | before | 82.69 | 75.00 | 78.85 | 0.58 |
| after | 94.23 | 78.85 | 86.53 | 0.74 | |
| D.melanogaster | before | 74.57 | 77.12 | 75.85 | 0.52 |
| after | 84.74 | 86.44 | 85.59 | 0.71 | |
| A.thaliana | before | 82.57 | 76.51 | 79.54 | 0.59 |
| after | 80.30 | 83.33 | 81.81 | 0.64 | |
| E.coli | before | 92.30 | 69.23 | 80.76 | 0.63 |
| after | 88.46 | 88.46 | 88.46 | 0.77 | |
| G.subterraneus | before | 83.33 | 75.00 | 79.17 | 0.59 |
| after | 91.67 | 81.67 | 86.67 | 0.74 | |
| G.pickeringii | before | 81.57 | 78.94 | 80.26 | 0.61 |
| after | 86.84 | 89.47 | 88.15 | 0.76 |
Fig. 3The ROC curves before and after feature selection
Fig. 4The confidence of predicted label in case studies
Independent Test Results on Benchmark Datasets (Sn, Sp and ACC:%)
| Methods | Datasets | Sn | Sp | ACC | MCC |
|---|---|---|---|---|---|
| C.elegans | 80.77 | 73.08 | 76.92 | 0.54 | |
| D.melanogaster | 74.58 | 77.97 | 76.27 | 0.53 | |
| iDNA4mC | A.thaliana | 80.3 | 77.27 | 78.79 | 0.58 |
| E.coli | 96.15 | 69.23 | 82.69 | 0.68 | |
| G.subterraneus | 85.00 | 76.67 | 80.83 | 0.62 | |
| G.pickeringii | 81.58 | 78.95 | 80.26 | 0.61 | |
| C.elegans | 85.58 | 78.85 | 82.21 | 0.65 | |
| D.melanogaster | 83.90 | 81.36 | 82.63 | 0.65 | |
| 4mCPred | A.thaliana | 76.52 | 76.52 | 76.52 | 0.53 |
| E.coli | 84.62 | 80.77 | 82.69 | 0.65 | |
| G.subterraneus | 91.67 | 75.00 | 83.33 | 0.68 | |
| G.pickeringii | 86.84 | 68.42 | 77.63 | 0.56 | |
| C.elegans | 94.23 | 78.85 | 86.53 | 0.74 | |
| D.melanogaster | 84.74 | 86.44 | 85.59 | 0.71 | |
| this | A.thaliana | 80.30 | 83.33 | 81.81 | 0.64 |
| study | E.coli | 88.46 | 88.46 | 88.46 | 0.77 |
| G.subterraneus | 91.67 | 81.67 | 86.67 | 0.74 | |
| G.pickeringii | 86.84 | 89.47 | 88.15 | 0.76 |
Cross Validation Result on Benchmark Datasets (Sn, Sp and ACC:%; TP: true positive, FN: false negative, FP: false positive, TN: true negative)
| Datasets | Methods | Sn | Sp | ACC | MCC | TP | FN | FP | TN |
|---|---|---|---|---|---|---|---|---|---|
| iDNA4mC | 79.7 | 77.5 | 78.6 | 0.572 | 1328 | 316 | 349 | 1205 | |
| C.elegans | 4mCPred | 82.5 | 82.6 | 82.6 | 0.652 | 1282 | 272 | 270 | 1284 |
| 4mCPred_SVM | 82.4 | 80.7 | 81.5 | 0.631 | 1280 | 274 | 300 | 1254 | |
| this study | 84.9 | 80.4 | 82.6 | 0.653 | 1319 | 235 | 305 | 1249 | |
| iDNA4mC | 83.3 | 79.1 | 81.2 | 0.625 | 1474 | 295 | 369 | 1400 | |
| D.melanogaster | 4mCPred | 82.4 | 82.1 | 82.2 | 0.646 | 1458 | 311 | 317 | 1452 |
| 4mCPred_SVM | 83.8 | 82.2 | 83.0 | 0.661 | 1483 | 286 | 314 | 1455 | |
| this study | 85.4 | 83.2 | 84.3 | 0.686 | 1510 | 259 | 297 | 1472 | |
| iDNA4mC | 75.7 | 76.2 | 76.0 | 0.519 | 1498 | 480 | 471 | 1507 | |
| A.thaliana | 4mCPred | 75.5 | 78.0 | 76.8 | 0.536 | 1494 | 484 | 435 | 1543 |
| 4mCPred_SVM | 77.8 | 79.6 | 78.7 | 0.573 | 1538 | 440 | 404 | 1574 | |
| this study | 78.3 | 80.5 | 79.4 | 0.589 | 1549 | 429 | 385 | 1593 | |
| iDNA4mC | 82.0 | 77.8 | 79.9 | 0.598 | 318 | 70 | 86 | 302 | |
| E.coli | 4mCPred | 81.9 | 83.2 | 82.6 | 0.655 | 318 | 70 | 65 | 302 |
| 4mCPred_SVM | 85.8 | 80.7 | 83.3 | 0.666 | 333 | 51 | 67 | 321 | |
| this study | 86.1 | 82.5 | 84.3 | 0.686 | 334 | 54 | 68 | 320 | |
| iDNA4mC | 82.2 | 80.8 | 81.5 | 0.630 | 745 | 161 | 174 | 732 | |
| G.subterraneus | 4mCPred | 81.8 | 83.7 | 82.8 | 0.662 | 742 | 164 | 148 | 758 |
| 4mCPred_SVM | 84.0 | 83.4 | 83.7 | 0.674 | 760 | 145 | 150 | 755 | |
| this study | 83.6 | 85.7 | 84.7 | 0.694 | 757 | 148 | 129 | 776 | |
| iDNA4mC | 82.4 | 83.8 | 83.1 | 0.663 | 469 | 100 | 92 | 477 | |
| G.pickeringii | 4mCPred | 85.0 | 81.0 | 83.0 | 0.668 | 484 | 85 | 108 | 461 |
| 4mCPred_SVM | 86.3 | 85.8 | 86.0 | 0.721 | 491 | 78 | 81 | 488 | |
| this study | 86.3 | 89.1 | 87.7 | 0.754 | 491 | 78 | 62 | 507 |
4mC site identificaiton in case studies (TP: True Postive,; FN: False Negative)
| Case | Methods | Total | TP | FN |
|---|---|---|---|---|
| iDNA4mC | 26 | 19 | 7 | |
| dlk-1 | 4mCPred | 26 | 25 | 1 |
| 4mCPred_SVM | 26 | 20 | 6 | |
| This study | 26 | 24 | 2 | |
| iDNA4mC | 137 | 70 | 67 | |
| DSCAM | 4mCPred | 137 | 121 | 16 |
| 4mCPred_SVM | 137 | 122 | 15 | |
| This study | 137 | 126 | 11 |
Summary of six benchmark datasets
| Species | Positive Sample | Negative Sample | Total |
|---|---|---|---|
| 1554 | 1554 | 3108 | |
| 1769 | 1769 | 3538 | |
| 1978 | 1978 | 3956 | |
| 388 | 388 | 776 | |
| 906 | 906 | 1812 | |
| 569 | 569 | 1138 |
Fig. 5Sequence logos for DNA samples in the benchmark datasets