Literature DB >> 33868598

Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation.

Jhabindra Khanal¹, Hilal Tayara², Quan Zou³, Kil To Chong^1,4.

Abstract

DNA N4-methylcytosine (4mC), an epigenetic modification found in prokaryotic and eukaryotic species, is involved in numerous biological functions, including host defense, transcription regulation, gene expression, and DNA replication. To identify 4mC sites, previous computational studies mostly focused on finding hand-crafted features. This area of research, therefore, would benefit from the development of a computational approach that relies on automatic feature selection to identify relevant sites. We here report 4mC-w2vec, a computational method that learned automatic feature discrimination in the Rosaceae genomes, especially in Rosa chinensis (R. chinensis) and Fragaria vesca (F. vesca), based on distributed feature representation and through the word embedding technique 'word2vec'. While a few bioinformatics tools are currently employed to identify 4mC sites in these genomes, their prediction performance is inadequate. Our system processed 4mC and non-4mC sites through a word embedding process, including sub-word information of its biological words through k-mer, which then served as features that were fed into a double layer of convolutional neural network (CNN) to classify whether the sample sequences contained 4mCs or non-4mCs sites. Our tool demonstrated performance superior to current tools that use the same genomic datasets. Additionally, 4mC-w2vec is effective for balanced and imbalanced class datasets alike, and the online web-server is currently available at: http://nsclbio.jbnu.ac.kr/tools/4mC-w2vec/.

Entities: Chemical Disease Species

Keywords: Convolutional Neural Network; DNA N4-methylcytosine (4mC); Sequence analysis; Web-server; Word embedding

Year: 2021 PMID： 33868598 PMCID： PMC8042287 DOI： 10.1016/j.csbj.2021.03.015

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Introduction

Epigenetics refers to the heritable changes in gene function that are not related to modifications of the DNA sequence itself [1]. DNA methylation is one of the most widely known epigenetic marks, as it plays a vital role in various critical biological process, including changes in chromatin structure, ensuring the stability of DNA, gene-expression control, DNA conformation, X-chromosome inactivation, gene regulation, cellular differentiation, and cancer progression [2], [3], [4], [5]. One of the most widespread DNA methylation modification is N4-methylcytosine (4mC), it was primarily described in 1983 [6] which is methylated on the fourth position of the cytosine pyrimidine ring of both eukaryotes and prokaryotes (though 4mC is more commonly found and studied in the latter). In prokaryotes, 4mC is part of a restriction-modification(R-M) system that defends against activities of foreign DNA, including its repair, expression, and replication [7], [8], [9], [10], [11]. 4mC also plays a supplementary role in, among other things, genome stabilization, recombination, and evolution [12], [13], [14]. The biological roles of 4mC in eukaryotes is less understood, in part because the small size of 4mC in the eukaryote genome prevents its detection through anything other than high sensitivity techniques. To identify 4mC sites experimentally, Single Molecule of Real-Time (SMRT), mass spectrometry, and methylation-precise PCR have all been used [15], [16], [17], [18]. These methods, however, are time consuming and labor-intensive. Analysis of the ‘big data’ associated with the Rosaceae genome, with proper computational tools may be a more efficient means of accurately identifying 4mC sites. Several in silico methods have been proposed to identify 4mC sites for some species (e.g. E. coli, G. subterraneus, A. thaliana, D. melanogaster, C. elegans, G. pickeringii, and mice) using the recently constructed database MethSMRT [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32]. To the best of our knowledge, only two computational methods are currently available to identify 4mC sites in the Rosaceae genome: i4mC-Rose [33] and DNC4mC-Deep [34]. The i4mC-Rose tool is the result of a random forest classifier with multiple encoding schemes, while DNC4mC-Deep is the result of a deep learning approach with six encoding techniques. Although these methods produced acceptable results, there is still much room for improvement, especially given that the adopted datasets may not have been of sufficient quality to capture the 4mC motifs, or the feature selection methods employed may not have been suitable to distinguish between the sequence information of positive and negative classes. Moreover, previous methods relied on domain knowledge to hand-design for the input features. Our method, in contrast, captures automatically a high level of input features by word embedding, allowing for a novel and highly accurate computational tool. In this paper, a sequence-based DNA 4mC sites predictor was developed. Our central idea was to transform the DNA sequences into vectors by word embedding and then process these with a double-layer one-dimensional CNN for the final classification. Word embedding was invented to apply in by Google in 2013 [35] to assist with natural language processing (NLP), but it later found success in number of biological applications [36], [37], [38], [39], [40], [41], [42], [43], deep learning of the sort employed in our second step has achieved notable results in a number of areas, including speech recognition [44], image recognition [45], [46], NLP [47], and genome wide prediction [48], [49], [50], [51], [52], [53]. In our study, integrating the techniques of word embedding and deep learning gave outstanding results for both balanced and imbalanced class datasets, and we suggest that the proposed method is promising for genome-wide prediction.

Materials and methods

Datasets construction

It was necessary to construct a reliable dataset to develop our sequence-based identifier. We independently constructed a complete set of training and independent datasets. 4mC containing sequences (the positive sequences) were obtained from the MDR database [54], http://mdr.xieslab.org/. According to previous researches, the best prediction performances were obtained with the length of 41-nt [22], [29]. Therefore, the length of the DNA sequences were set to 41-nt, containing ‘C’ at the center. Previous researchers [33], [34] applied a modification QV (modQV) score of 20 to generate a positive dataset, but as W.Chen et al. have pointed out, a modQV score of 30 or more is the default or the best threshold for labelling the position of a cytosine as modified [21]. In the interest of developing a more reliable model, we applied QV of 30 to construct our positive dataset and excluded the sequences that share QV values 30. To remove sequence similarity, CD-HIT [55] software with the cutoff threshold of 65.00% was used. As a result of these procedures, we obtained 4321 in F. vesca genome, and 2421 positive sequences in R. chinensis genome. From these datasets, approximately 80% of the sequences (3457 (F. vesca) and 1938 (R. chinensis)) were selected as training sets, with remaining sequences (864 (F. vesca) and 483 (R. chinensis)), used as independent datasets. The negative sequences (non-4mC site containing sequences) were obtained from the same genome file where the 4mC sites (‘C’ at the center) was not detected by the SMRT sequencing technique. In this way, a large number of negative sequences in each species were formed with ‘C’ at the center. For model training, positive and negative sequences were balanced out. To test the efficiency of our proposed model, we constructed the independent datasets with different ratios of positive and negative samples. For F. vesca these were: 1:1 [864 positive and 864 negative sequences], 1:5 [864 positive and 4320 negative sequences], and 1:15 [864 positive and 12960 negative sequences]. For R. chinensis, 1:1 [483 positive and 483 negative sequences], 1:5 [483 positive and 2415 negative sequences], and 1:15 [483 positive and 7245 negative sequences]. Due to limit number of the independent-positive sequences the same positive sequences were accepted for all ratio groups (i.e. 864 for F. vesca and 483 for R. chinensis). The negative sequences did not overlap across ratio groups. These training and independent datasets for both species is summarized in Table 1.

Table 1

Summary of training and independent test datasets for F. vesca and R. chinensis.

Genomes	Positive/Negative	Training datasets	Independent datasets
F. vesca	Positive	3457	864
	Negative	3457	864, 4320, 12960
R. chinensis	Positive	1938	483
	Negative	1938	483, 2415, 7245

Summary of training and independent test datasets for F. vesca and R. chinensis. We elected to construct such imbalanced class datasets prior to testing as it is common to find real-world datasets that have such strongly imbalanced distributions. Accordingly, we aimed that to assist researchers with testing imbalanced datasets using a classifier. To the best of our knowledge, we first to proposed an i4mC-w2vec tool that deal with imbalanced class datasets in this area (4mC sites prediction).

Methodology

We present a novel method (4mC-w2vec) for identifying 4mC sites in the Rosaceae genome. Our consists of two major steps. The first step is the discriminative feature generation or representation stage in which each DNA sequence is described into words using 3-mer, after which a word-embedding method is applied to map each word to its corresponding feature representation. For the second step, a deep learning model is used to classify 4mCs and non-4mCs based on the generated features of the first stage. A detailed explanation is presented in the following sections, and the general architecture is illustrated in Fig. 1.

Fig. 1

A general architecture of the proposed model: (a) word embedding process and (b) one-dimensional CNN model.

Distributed feature representation

These days many real-world biological data applications involve datasets that are strongly imbalanced distributive, complex, and noisy. We decided to apply a word embedding technique commonly known as ‘word2vec’ [35]. This technique generates an optimal set of feature vectors based on distributional hypothesis [56]. Word2vec is a two-layer neural network that processes text by vectorizing words as depicted in Fig. 1 (a). It receives input as a text corpus and its output is feature vectors that represent words in that corpus. This technique decreases computational complexity and reduce the noise, ultimately leading to improved performance in the resultant computational model. Additionally, many biological codes (such as genetic code) can be represented as a language [57], [58], [59], with the resulting insights can being applied towards solving a variety of biological problems [58], [60], [61]. Accordingly, we adopted the word2vec method to find interpretable representations for each 4mC sites. Corpus construction discovers the semantic relations between words large files. For our research, we generated the corpus by processing the genomes of F. vesca (wild strawberry (NC_020491.1)) and R. chinensis (Chinese rose (NC_037093.1)) using NCBI genomic data, available at https://www.ncbi.nlm.nih.gov. The first step of training word2vec is building a corpus vocabulary. The word2vec model can be applied based on either Continuous Bag-Of-Words (CBOW) or Skip-gram methods. In the Skip-gram model, the current word (w(t)) or input is used to predict the surrounding window of context word. In contrast, the CBOW method attempts to guess the target word based on its neighboring (context) words. As inputs into a CBOW model, a window size of five was formulated as follows Eqn 1: The CBOW and Skip-gram perform similarly, although Skip-gram is more useful and gives a better outcome for infrequent words [62]. In our study, we are concerned with frequent words, and therefore adopted CBOW for word2vec training. To process the CBOW, genome assemblies were divided into sentences with lengths of 200-nt. Next, each sentence was divided into overlapping 3-mer to form words (such as AAT, CCT, GCN, and CCC). At this point, each 4mC contains a chain of continuous nucleotides. Those words were fed into a two layer of word2vec model as depicted in Fig. 1 (a). As a result, each word had its own 100-dimension (D) vector representation, with each sequence of length Lrepresented by an array of shape . For example, the word ‘AAT’ was represented as a 100-(D) vector of and ‘CCT’ was represented as a 100-D vector of . The parameters that were used to train word2vec are listed in Table 2. Most of the parameters were left as default. According to the previous researches the best performance was obtained by the creation of 100-D [36], [43], [50]. Therefore, the parameter for dimension of the word vectors was set to 100-D. We included all words with frequency greater than 1. For context words we have tested different overlapping k-mers such as k = 1 (A), k = 2 (AT), k = 4 (ATCG), k = 5 (ATCGA), and k = 6 (ATCGAT). Negative sampling was set to 5 to draw ‘noise words’. Window size was set to 5 for maximum distance between the current and predicted word within a sentence. Number of epochs (iterations) over the corpus was set to 20. The word2vec was trained independently for both species using the python library genism [63].

Table 2

Word2vec training parameters.

Parameters	word2vec model
Training Method	CBOW
Vector Size	100
Corpus	Genomes of F. vesca and R. chinensis
Minimum Count	1
Context Words	3-mer
Negative Sampling	5
Window Size	5
Number of Epochs	20

Word2vec training parameters.

CNN model

As shown in Fig. 1 (b), we used a CNN model (a deep feed-forward neural network) to learn the features generated from word2vec. In a CNN, hyper-parameters determine layer architecture in the training step, and this affects model accuracy and learning time. Therefore, grid search strategy was used for hyper-parameters optimization, including the number of filters, the kernel sizes (size of filters), the dropout rates, the number of convolutional layers, and the activation functions. After applying the grid search technique, the proposed CNN model yielded two one dimensional convolution layers with 64 filters of 9 units and one stride unit. In each convolution layer, a rectified linear activation unit (ReLU) was used as an activation function. To fix the over-fitting problem, the first layer convolution was followed by a dropout layer with a rate of 0.7. For final classification, a fully connected layer with one node followed by sigmoid function was used. The configuration of the CNN model is presented in Table 3. To train the CNN model on the training datasets, the learning rate was set to 0.0007, and the batch size was set to 128 with an early stopping strategy based on the validation loss. RMSprop was used as an optimizer [64] and binary cross-entropy was used as a loss function [65]. The Keras framework, a python open source library, ( https://keras.io/), was used to build the 4mC-w2vec. The trained models will be able to learn an imbalanced class datatsets by setting the ‘class weight’ during the CNN training phase. Therefore, we used the ‘class weight’ programmatically using the Scikit-learn [66].

Table 3

The proposed CNN’s architecture.

Layers	Output shape
Input	[39,100]
Conv1D (64,9,1)	[39,64]
Conv1D (64,9,1)	[39,64]
Dropout (0.5)	[1248]
Dense	[1]
Sigmoid	[1]

The proposed CNN’s architecture.

Evaluation parameters

Various statistical metrics, including sensitivity (Sn), specificity (Sp), accuracy (ACC), and Matthew correlation coefficient (MCC) were used to evaluate the performance of the models [67], [68], [69]. The symbols in Eqs. (2), (3), (4), (5) are:where, TP, FP, TN, FN are either true positive, false positive, true negative, and false negative values. We also included the Receiver operating Characteristic (ROC) curves to evaluate the proposed method. Overall performance quality was represented by the area under the ROC curve (auROC) [70]. When evaluating binary classifiers on imbalanced class datasets, the precision-recall curve is more helpful than the ROC curve (as pointed out by [71]).

Result and discussion

Analysis of nucleotide composition preference

To demonstrate the nucleotide composition preferences between positives (4mC containing sequences) and negatives (non-4mC containing sequences), the Two Sample Logo tool was used [72]. The height of bases was formed as maintained by their statistical significance (by t-test). As seen in Fig. 2, the ‘C’ nucleobase was located in the center of the sequences with length 41. In case of F. vesca, both the C and G bases were enriched (over-represented), while both ‘A’ and ‘T’ bases were depleted (under-representated). Specifically, ‘C’ was over-represented at positions 1–3, 7–12, 14–20, 23, 24, 27, 30, and 33–41 and under-represented at position at 6, while ‘G’ base was enriched at positions 1, 4–7, 10, 11, 13–20, 22–26, 29, 32, 34, 35, 37, and 41 and depleted at position 28. The ‘A’ base was depleted at positions 1–3, 7–19, 22–24, 27, 30, and 34–39 and enriched at positions 25, 26, 28, and 29, while ‘T’ was depleted at positions 1, 3, 4, 7–11, 13–20, 22–27, 29, 32, 33, 35, 37, 38, 40, and 41, and was not enriched at any position. Some nucleotide base pairs became visible along the DNA sequences. For example, in the 4mC containing sequences two consecutive ‘C’ and ‘G’ bases were spotted at positions 1–3, 7, 11, 14, 14–20, 23, 24 34, 35, and 37–41.

Fig. 2

Demonstration of nucleotide composition preferences between positives (4mC containing sequences) and negatives (non-4mC containing sequences) for F. vesca and R. chinensis datasets.

Demonstration of nucleotide composition preferences between positives (4mC containing sequences) and negatives (non-4mC containing sequences) for F. vesca and R. chinensis datasets. In case of R. chinensis, ‘A’ was enriched at positions 4, 25, 26, 28, 29, and 33 and depleted at positions 20, 22, and 23. The successive ‘C’ was enriched at positions 7, 12–20, 23, 27, 30, 38, and 39 and depleted at positions 5, 6, 25, and 26. ‘G’ was enriched at positions 5, 6, 20, 22–24, 26, and 35 and depleted at position 4, 12, 15, 16, 19, and 28–30, while ‘T’ was enriched in only a few positions, including 28, and 31 and depleted at positions 7, 14, 17, 18, 20, 22–24, 26, 27, 29, 32, 33, and 35. To put it succinctly, in both species, there was significant variation between over-represented and under-represented nucleotides between the 4mC and non-4mC containing sequences. All results shown in Fig. 2 demonstrate that the four nucleotides distribution around 4mC sites has statistically significant position-specific difference between 4mC containing and non-4mC containing samples. Therefore, it is possible to design a computational model to identify 4mC sites only based on sequence information.

Effect of using different encoding methods

Based on overlapping k-mer values (such as 1-mer, 2-mer, 3-mer, 4-mer, 5-mer, and 6-mer) six-feature vectors models were obtained by the word embedding process. All these vector representation model were fed into the CNN for independent identification of 4mC sites. We observed that the 3-mer was more informative for predicting 4mC sites for both species. In this study, the word2vec representation based on 3-mer and classified by CNN was considered the final model, or the ‘i4mC-w2vec’. In cross-validation test, the proposed predictor obtained 0.7407 MCC, 0.8697 accuracy, and 0.9400 AUC for the F. vesca, and 0.7093 MCC, 0.8541 accuracy, and 0.9370 AUC the predictor obtained for R. chinensis. According to prior research, biological sequences encoded with one-hot method in conjunction with deep learning model performed well in 4mC prediction task [20]. We accordingly used one-hot encoding scheme to encode the DNA sequences, in which nucleotides A, C, G, and T were coded as (1 0 0 0), (0 1 0 0), (0 0 1 0), and (0 0 0 1), respectively. To determine the best parameters for CNN using one-hot encoding, the grid search algorithm was used. The results showed that the word2vec (based on 3-mer and 4-mer) method outperformed the one-hot method. The performance of the six word embedding model based on different k-mers and one-hot encoding when classified by CNN is presented in Table 4. More generally, the auROC of the F. vesca was 0.8920 using one-hot encoding but 0.9400 using word2vec (3-mer) encoding (Fig. 3) (a). Similarly, auROC of the R. chinensis was 0.9110 using one-hot encoding while it is 0.9370 using word2vec (3-mer) (Fig. 3) (b).

Table 4

Performance of the CNN using different word2vec models (based on k-mers) and one-hot encoding on the training dataset for both species by a 5-fold cross-validation test.

Species	Methods	Sn	Sp	ACC	MCC	AUC
F. vesca	k = 1	0.7963	0.7700	0.7832	0.5666	0.8520
	k = 2	0.7984	0.8295	0.8141	0.6283	0.8141
	k = 3	0.8976	0.8417	0.8697	0.7407	0.9400
	k = 4	0.8141	0.8600	0.8374	0.6751	0.9155
	k = 5	0.7931	0.7582	0.7754	0.5516	0.8505
	k = 6	0.7984	0.7302	0.7638	0.5296	0.8435
	onehot	0.8507	0.8244	0.8374	0.6752	0.8920
R. chinensis	k = 1	0.8711	0.6873	0.7793	0.5682	0.8781
	k = 2	0.8144	0.8199	0.8541	0.6335	0.8934
	k = 3	0.8219	0.8854	0.8541	0.7093	0.9370
	k = 4	0.8664	0.7633	0.8141	0.6326	0.8891
	k = 5	0.8220	0.7519	0.7870	0.5755	0.8755
	k = 6	0.7722	0.8066	0.7896	0.5793	0.8604
	onehot	0.7958	0.8371	0.8167	0.6337	0.9110

Note: The best performance value for each metric across different methods is highlighted in bold.

Fig. 3

Performance comparisons of word2vec-based model and one-hot encoding-based model when classified by CNN using a 5-fold cross-validation test on F. vesca (a) and R. chinensis (b).

Performance of the CNN using different word2vec models (based on k-mers) and one-hot encoding on the training dataset for both species by a 5-fold cross-validation test. Note: The best performance value for each metric across different methods is highlighted in bold. Performance comparisons of word2vec-based model and one-hot encoding-based model when classified by CNN using a 5-fold cross-validation test on F. vesca (a) and R. chinensis (b).

Performance comparison with existing methods on the independent test datasets

To test whether the 4mC-w2vec could identify 4mC sites on balanced and imbalanced blind datasets, we ran the model on the independent test datasets with different ratios of positive and negative samples (see Section 2.1). For imbalanced classification with a few sequences of minority (positive) class, auROC can be misleading, a large change in a ROC curve or auROC score may occur with even a small number of correct or incorrect predictions made by a model [71], [73]. For this reason, numerous surveys have suggested that a precision-recall curve (PR curve) is a superior alternative [74]. A PR curve is a plot of the precision (y-axis) and recall (x-axis) of different probability thresholds. Precision and recall are concerned on minority class (positive), but not majority (negative) class [75]. A precision-recall AUC (PRauc) score of 1 represents a perfect model. To demonstrate the superiority of i4mC-w2vec method, a comparison with existing methods was performed, including i4mC-Fuse [30], and DNC4mC-Deep [31]. These two web-servers were recently constructed, and both focus on the genomes of F. vesca and R. chinensis to identify 4mC cites. For our comparison, we directly submitted the same positive/negative ratios of the independent datasets to these two web-servers. The performance of the 4mC-w2vec, DNC4mC-Deep, and i4mC-Fuse based on different evaluation indexes is presented in Table 5, with corresponding precision-recall curves presented in Fig. 4. As shown in Table 5, between these three methods, 4mC-w2vec achieved the best performance, as measured across all evaluation indexes for all the different ratios of the both species.

Table 5

The performance of the i4mC-Fuse, DNC4mC-Deep, and i4mC-w2vec on the independent datasets with different ratios.

Species	Method	Sn	Sp	ACC	MCC	PRauc
F. vesca	i4mC-Fuse
	ratio of [1:1]	0.8376	0.7209	0.7793	0.5624	0.8482
	ratio of [1:5]	0.8530	0.7105	0.7819	0.5695	0.8606
	ratio of [1:15]	0.8569	0.6434	0.7703	0.5586	0.8517
	DNC4mC-Deep
	ratio of [1:1]	0.8582	0.7390	0.7987	0.6016	0.8438
	ratio of [1:5]	0.8560	0.6950	0.7858	0.5810	0.8694
	ratio of [1:15]	0.8556	0.7183	0.7870	0.5795	0.8723
	i4mC-w2vec
	ratio of [1:1]	0.8994	0.8268	0.8632	0.7283	0.9176
	ratio of [1:5]	0.8814	0.8449	0.8632	0.7269	0.9188
	ratio of [1:15]	0.8762	0.8062	0.8412	0.6842	0.9021
R. chinensis	i4mC-Fuse
	ratio of [1:1]	0.8505	0.7312	0.7909	0.5860	0.8646
	ratio of [1:5]	0.8411	0.6718	0.7716	0.5541	0.8526
	ratio of [1:15]	0.8072	0.6149	0.7612	0.5461	0.8507
	DNC4mC-Deep
	ratio of [1:1]	0.8637	0.7131	0.7935	0.5946	0.8700
	ratio of [1:5]	0.8537	0.7235	0.7987	0.6000	0.8641
	ratio of [1:15]	0.8391	0.6511	0.7703	0.5564	0.8594
	i4mC-w2vec
	ratio of [1:1]	0.8737	0.8242	0.8490	0.6988	0.9099
	ratio of [1:5]	0.884	0.7957	0.8400	0.6825	0.8966
	ratio of [1:15]	0.8940	0.8113	0.8477	0.6972	0.9136

Fig. 4

comparison of PRC generated by our method and two existing methods on the different ratios of the balanced/imbalanced independent test datasets for both species. The PRauc scores and PR curves show that the 4mC-w2vec outperforms the existing methods in the F. vesca (a–c) and R. chinensis (d–e) datasets.

The performance of the i4mC-Fuse, DNC4mC-Deep, and i4mC-w2vec on the independent datasets with different ratios. comparison of PRC generated by our method and two existing methods on the different ratios of the balanced/imbalanced independent test datasets for both species. The PRauc scores and PR curves show that the 4mC-w2vec outperforms the existing methods in the F. vesca (a–c) and R. chinensis (d–e) datasets. It is clear that our model performed better than the existing ones for every ration group. Specifically, PR AUCs of the 4mC-w2vec were 3–6% higher than those of the two existing methods, showing that our model is the most appropriate for 4mC site prediction on both imbalanced (ratio groups 1:5,1:15) and balanced (ratio group 1:1) datasets. Moreover, our results demonstrate that our model is stable against the increasing ratios of the imbalanced class datasets, while the performance of other methods decreased as the positive-to-negative ratio within the datasets increased. The leading reasons for superior performance obtained from our model are as follows. Previous methods required encoding the features manually based on the domain-knowledge experience. On the other hand, the proposed model does not require any domain-knowledge. Instead, it learns the features automatically using word2vec model from the complete genome instead of using the small set of sequences. Furthermore, The input sequence of the CNN model should be encoded in a way that preserves its information. Therefore, encoding each input sequence based on the information learned from the whole genome using word2vec helped in better representation of the input sequence compared to other simple techniques such as one hot encoding as shown in Table 4.

Web-server

A freely accessible web application was established at http://nsclbio.jbnu.ac.kr/tools/4mC-w2vec/. The general steps to use this are: (1) upload or copy/paste the exact 41nt DNA sequence in FASTA format (sequences start with symbol); (2) select a threshold value between 0–1 [0.5 is recommended]; (3) select a species from the list box; (4) click the ‘Submit sequences’ button to obtain a prediction. The complete datasets used in this study and trained word2vec models (total 12 models, six for each species using k = 1–6) of the genomes of F. vesca and R. chinensis are available in the dataset section of the webserver http://nsclbio.jbnu.ac.kr/tools/4mC-w2vec/.

Conclusion

Accurately identifying 4mC sites is an important step towards understanding many biological functions. We developed a computational model using word embedding method in conjunction with a deep neural network to identify such sites. The chief advantage of the proposed model over its predecessors is the automatic creation of high dimension word-vectors for the whole genomes of F. vesca and R. chinensis, resulting in superior feature representation of 4mC sites. Put differently, the CNN can effectively capture feature generated by the word embedding process. Ultimately, our proposed method achieved better outcomes in identifying 4mC sites in both balanced and imbalanced class labels than the state-of-the-art predictors.The study presented in the paper could helpful for more widespread bioinformatics applications.

Funding

This work was supported in part by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. 2020R1A2C2005612) and in part by the Brain Research Program of the National Research Foundation (NRF) funded by the Korean government (MSIT) (No. NRF-2017M3C7A1044816) and in part by research funds for newly appointed professors of Jeonbuk National University, South Korea, in 2020.

CRediT authorship contribution statement

Jhabindra Khanal: Conceptualization, Data curation, Formal analysis, Methodology, Software, Validation, Visualization, Investigation, Writing - original draft, Writing - review & editing. Hilal Tayara: Conceptualization, Data curation, Formal analysis, Methodology, Software, Validation, Visualization, Investigation, Writing - original draft, Writing - review & editing. Quan Zou: Writing - original draft, Writing - review & editing. Kil To Chong: Writing - original draft, Writing - review & editing, Project administration, Supervision, Resources, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

50 in total

Review 1. DNA methylation and human disease.

Authors: Keith D Robertson
Journal: Nat Rev Genet Date: 2005-08 Impact factor: 53.242

2. Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments.

Authors: Vladimir Vacic; Lilia M Iakoucheva; Predrag Radivojac
Journal: Bioinformatics Date: 2006-04-21 Impact factor: 6.937

Review 3. Bacterial genetics: past achievements, present state of the field, and future challenges.

Authors: Herbert Schweizer
Journal: Biotechniques Date: 2008-04 Impact factor: 1.993

4. Escherichia coli mutator mutants deficient in methylation-instructed DNA mismatch correction.

Authors: B W Glickman; M Radman
Journal: Proc Natl Acad Sci U S A Date: 1980-02 Impact factor: 11.205

Review 5. DNA methylation and its basic function.

Authors: Lisa D Moore; Thuc Le; Guoping Fan
Journal: Neuropsychopharmacology Date: 2012-07-11 Impact factor: 7.853

6. Effects of high levels of DNA adenine methylation on methyl-directed mismatch repair in Escherichia coli.

Authors: P J Pukkila; J Peterson; G Herman; P Modrich; M Meselson
Journal: Genetics Date: 1983-08 Impact factor: 4.562

Review 7. A primer on deep learning in genomics.

Authors: James Zou; Mikael Huss; Abubakar Abid; Pejman Mohammadi; Ali Torkamani; Amalio Telenti
Journal: Nat Genet Date: 2018-11-26 Impact factor: 38.330

8. A novel methodology on distributed representations of proteins using their interacting ligands.

Authors: Hakime Öztürk; Elif Ozkirimli; Arzucan Özgür
Journal: Bioinformatics Date: 2018-07-01 Impact factor: 6.937

9. CD-HIT: accelerated for clustering the next-generation sequencing data.

Authors: Limin Fu; Beifang Niu; Zhengwei Zhu; Sitao Wu; Weizhong Li
Journal: Bioinformatics Date: 2012-10-11 Impact factor: 6.937

10. 4mCpred-EL: An Ensemble Learning Framework for Identification of DNA N⁴-methylcytosine Sites in the Mouse Genome.

Authors: Balachandran Manavalan; Shaherin Basith; Tae Hwan Shin; Da Yeon Lee; Leyi Wei; Gwang Lee
Journal: Cells Date: 2019-10-28 Impact factor: 6.600

4 in total

1. CNNLSTMac4CPred: A Hybrid Model for N4-Acetylcytidine Prediction.

Authors: Guiyang Zhang; Wei Luo; Jianyi Lyu; Zu-Guo Yu; Guohua Huang
Journal: Interdiscip Sci Date: 2022-02-01 Impact factor: 2.233

2. BERT-m7G: A Transformer Architecture Based on BERT and Stacking Ensemble to Identify RNA N7-Methylguanosine Sites from Sequence Information.

Authors: Lu Zhang; Xinyi Qin; Min Liu; Guangzhong Liu; Yuxiao Ren
Journal: Comput Math Methods Med Date: 2021-08-25 Impact factor: 2.238

3. Systematic Analysis and Accurate Identification of DNA N4-Methylcytosine Sites by Deep Learning.

Authors: Lezheng Yu; Yonglin Zhang; Li Xue; Fengjuan Liu; Qi Chen; Jiesi Luo; Runyu Jing
Journal: Front Microbiol Date: 2022-03-15 Impact factor: 5.640

4. CSatDTA: Prediction of Drug-Target Binding Affinity Using Convolution Model with Self-Attention.

Authors: Ashutosh Ghimire; Hilal Tayara; Zhenyu Xuan; Kil To Chong
Journal: Int J Mol Sci Date: 2022-07-30 Impact factor: 6.208

4 in total