| Literature DB >> 33868598 |
Jhabindra Khanal1, Hilal Tayara2, Quan Zou3, Kil To Chong1,4.
Abstract
DNA N4-methylcytosine (4mC), an epigenetic modification found in prokaryotic and eukaryotic species, is involved in numerous biological functions, including host defense, transcription regulation, gene expression, and DNA replication. To identify 4mC sites, previous computational studies mostly focused on finding hand-crafted features. This area of research, therefore, would benefit from the development of a computational approach that relies on automatic feature selection to identify relevant sites. We here report 4mC-w2vec, a computational method that learned automatic feature discrimination in the Rosaceae genomes, especially in Rosa chinensis (R. chinensis) and Fragaria vesca (F. vesca), based on distributed feature representation and through the word embedding technique 'word2vec'. While a few bioinformatics tools are currently employed to identify 4mC sites in these genomes, their prediction performance is inadequate. Our system processed 4mC and non-4mC sites through a word embedding process, including sub-word information of its biological words through k-mer, which then served as features that were fed into a double layer of convolutional neural network (CNN) to classify whether the sample sequences contained 4mCs or non-4mCs sites. Our tool demonstrated performance superior to current tools that use the same genomic datasets. Additionally, 4mC-w2vec is effective for balanced and imbalanced class datasets alike, and the online web-server is currently available at: http://nsclbio.jbnu.ac.kr/tools/4mC-w2vec/.Entities:
Keywords: Convolutional Neural Network; DNA N4-methylcytosine (4mC); Sequence analysis; Web-server; Word embedding
Year: 2021 PMID: 33868598 PMCID: PMC8042287 DOI: 10.1016/j.csbj.2021.03.015
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Summary of training and independent test datasets for F. vesca and R. chinensis.
| Genomes | Positive/Negative | Training datasets | Independent datasets |
|---|---|---|---|
| Positive | 3457 | 864 | |
| Negative | 3457 | 864, 4320, 12960 | |
| Positive | 1938 | 483 | |
| Negative | 1938 | 483, 2415, 7245 |
Fig. 1A general architecture of the proposed model: (a) word embedding process and (b) one-dimensional CNN model.
Word2vec training parameters.
| Parameters | word2vec model |
|---|---|
| Training Method | CBOW |
| Vector Size | 100 |
| Corpus | Genomes of F. vesca and R. chinensis |
| Minimum Count | 1 |
| Context Words | 3-mer |
| Negative Sampling | 5 |
| Window Size | 5 |
| Number of Epochs | 20 |
The proposed CNN’s architecture.
| Layers | Output shape |
|---|---|
| Input | |
| Conv1D (64,9,1) | |
| Conv1D (64,9,1) | |
| Dropout (0.5) | |
| Dense | |
| Sigmoid |
Fig. 2Demonstration of nucleotide composition preferences between positives (4mC containing sequences) and negatives (non-4mC containing sequences) for F. vesca and R. chinensis datasets.
Performance of the CNN using different word2vec models (based on k-mers) and one-hot encoding on the training dataset for both species by a 5-fold cross-validation test.
| Species | Methods | Sn | Sp | ACC | MCC | AUC |
|---|---|---|---|---|---|---|
| k = 1 | 0.7963 | 0.7700 | 0.7832 | 0.5666 | 0.8520 | |
| k = 2 | 0.7984 | 0.8295 | 0.8141 | 0.6283 | 0.8141 | |
| k = 3 | 0.8417 | |||||
| k = 4 | 0.8141 | 0.8374 | 0.6751 | 0.9155 | ||
| k = 5 | 0.7931 | 0.7582 | 0.7754 | 0.5516 | 0.8505 | |
| k = 6 | 0.7984 | 0.7302 | 0.7638 | 0.5296 | 0.8435 | |
| onehot | 0.8507 | 0.8244 | 0.8374 | 0.6752 | 0.8920 | |
| k = 1 | 0.6873 | 0.7793 | 0.5682 | 0.8781 | ||
| k = 2 | 0.8144 | 0.8199 | 0.8541 | 0.6335 | 0.8934 | |
| k = 3 | 0.8219 | |||||
| k = 4 | 0.8664 | 0.7633 | 0.8141 | 0.6326 | 0.8891 | |
| k = 5 | 0.8220 | 0.7519 | 0.7870 | 0.5755 | 0.8755 | |
| k = 6 | 0.7722 | 0.8066 | 0.7896 | 0.5793 | 0.8604 | |
| onehot | 0.7958 | 0.8371 | 0.8167 | 0.6337 | 0.9110 |
Note: The best performance value for each metric across different methods is highlighted in bold.
Fig. 3Performance comparisons of word2vec-based model and one-hot encoding-based model when classified by CNN using a 5-fold cross-validation test on F. vesca (a) and R. chinensis (b).
The performance of the i4mC-Fuse, DNC4mC-Deep, and i4mC-w2vec on the independent datasets with different ratios.
| Species | Method | Sn | Sp | ACC | MCC | PRauc |
|---|---|---|---|---|---|---|
| ratio of [1:1] | 0.8376 | 0.7209 | 0.7793 | 0.5624 | 0.8482 | |
| ratio of [1:5] | 0.8530 | 0.7105 | 0.7819 | 0.5695 | 0.8606 | |
| ratio of [1:15] | 0.8569 | 0.6434 | 0.7703 | 0.5586 | 0.8517 | |
| ratio of [1:1] | 0.8582 | 0.7390 | 0.7987 | 0.6016 | 0.8438 | |
| ratio of [1:5] | 0.8560 | 0.6950 | 0.7858 | 0.5810 | 0.8694 | |
| ratio of [1:15] | 0.8556 | 0.7183 | 0.7870 | 0.5795 | 0.8723 | |
| ratio of [1:1] | 0.8994 | 0.8268 | 0.8632 | 0.7283 | 0.9176 | |
| ratio of [1:5] | 0.8814 | 0.8449 | 0.8632 | 0.7269 | 0.9188 | |
| ratio of [1:15] | 0.8762 | 0.8062 | 0.8412 | 0.6842 | 0.9021 | |
| ratio of [1:1] | 0.8505 | 0.7312 | 0.7909 | 0.5860 | 0.8646 | |
| ratio of [1:5] | 0.8411 | 0.6718 | 0.7716 | 0.5541 | 0.8526 | |
| ratio of [1:15] | 0.8072 | 0.6149 | 0.7612 | 0.5461 | 0.8507 | |
| ratio of [1:1] | 0.8637 | 0.7131 | 0.7935 | 0.5946 | 0.8700 | |
| ratio of [1:5] | 0.8537 | 0.7235 | 0.7987 | 0.6000 | 0.8641 | |
| ratio of [1:15] | 0.8391 | 0.6511 | 0.7703 | 0.5564 | 0.8594 | |
| ratio of [1:1] | 0.8737 | 0.8242 | 0.8490 | 0.6988 | 0.9099 | |
| ratio of [1:5] | 0.884 | 0.7957 | 0.8400 | 0.6825 | 0.8966 | |
| ratio of [1:15] | 0.8940 | 0.8113 | 0.8477 | 0.6972 | 0.9136 |
Fig. 4comparison of PRC generated by our method and two existing methods on the different ratios of the balanced/imbalanced independent test datasets for both species. The PRauc scores and PR curves show that the 4mC-w2vec outperforms the existing methods in the F. vesca (a–c) and R. chinensis (d–e) datasets.