Literature DB >> 34159192

i4mC-EL: Identifying DNA N4-Methylcytosine Sites in the Mouse Genome Using Ensemble Learning.

Yanjuan Li1, Zhengnan Zhao1, Zhixia Teng1.   

Abstract

As one of important epigenetic modifications, DNA N4-methylcytosine (4mC) plays a crucial role in controlling gene replication, expression, cell cycle, DNA replication, and differentiation. The accurate identification of 4mC sites is necessary to understand biological functions. In the paper, we use ensemble learning to develop a model named i4mC-EL to identify 4mC sites in the mouse genome. Firstly, a multifeature encoding scheme consisting of Kmer and EIIP was adopted to describe the DNA sequences. Secondly, on the basis of the multifeature encoding scheme, we developed a stacked ensemble model, in which four machine learning algorithms, namely, BayesNet, NaiveBayes, LibSVM, and Voted Perceptron, were utilized to implement an ensemble of base classifiers that produce intermediate results as input of the metaclassifier, Logistic. The experimental results on the independent test dataset demonstrate that the overall rate of predictive accurate of i4mC-EL is 82.19%, which is better than the existing methods. The user-friendly website implementing i4mC-EL can be accessed freely at the following.
Copyright © 2021 Yanjuan Li et al.

Entities:  

Year:  2021        PMID: 34159192      PMCID: PMC8187051          DOI: 10.1155/2021/5515342

Source DB:  PubMed          Journal:  Biomed Res Int            Impact factor:   3.411


1. Introduction

As a chemical modification occurring on DNA sequences, DNA methylation can change genetic properties under the condition that the order of DNA sequences remains unchanged. DNA methylation has many manifestations, such as 5-methylcytosine (5mC for short), N6-methyladenine (6 mA for short), and N4-methylcytosine (4mC for short) [1]. Among them, the 5mCs are widely present in prokaryotes and eukaryotes and are of great significance for controlling gene differentiation and gene expression, maintaining chromosome stability and cell structure [2, 3]. They also can cause some diseases such as cancer [4-6]. The 6mAs are also widely distributed in prokaryotes and eukaryotes, which play a crucial role in replication, expression, and transcription of gene [7]. The 4mCs which were found in 1983 mainly exist in prokaryotes, and they can control DNA replication, gene expression, and cell cycle [8]. However, compared with 5mCs and 6mAs, the current research on 4mCs is still insufficient. To make up for this defect and further understand 4mCs' biological properties and functions, the first thing we need to do is to identify 4mCs from various DNA sequences, which is still a hot research topic so far. In order to identify 4mCs, many biology-based approaches have been explored. Single molecule real-time sequencing technology (SMRT for short) [9, 10] detects optical signals of bases matching the template at the single-molecule level to identify 4mCs. 4mC-Tet-assisted-bisulfite-sequencing technology (4mC-Tet for short) [11] identifies 4mCs by using bisulfite to convert unmethylated cytosine in the DNA sequences into uracil while to keep methylated cytosine unchanged. However, this kind of technologies is time-consuming and resource-intensive. Moreover, the explosive growth of DNA sequences also makes it more difficult to achieve whole-genome sequencing through these technologies. Therefore, using machine learning (ML for short) to identify 4mCs shows more advantages. Up to now, there are many models using machine learning to identify 4mCs. iDNA4mC [12], the earliest model for 4mC identification, is primarily used to identify 4mCs from the genomes of six species, A.thaliana, C.elegans, D.melanogaster, E.coli, G.pickeringii, and G.subterraneus, and its positive data containing 4mCs were obtained from a reliable database called MethSMRT [13]. Soon afterwards, several other models, 4mcPred [14], 4mcPred-SVM [15], 4mcPred-IFL [16], and Meta-4mcPred [17], were proposed successively, which used the same dataset as iDNA4mC [12] for 4mCs identification of the genomes of these six species. i4mC-Rose [18] is the first and the only model for 4mCs identification in the genome of Rosaceae, and it derived positive dataset from the MDR [19] and the other reliable database for storing 4mC data. For the mouse genome we wanted to study, there have been currently two models, 4mCpred-EL [20] and i4mC-Mouse [21]. Among them, their samples containing 4mCs were also obtained from the MethSMRT database. In addition, 4mCpred-EL selected 4 ML algorithms and 7 feature encoding schemes to generate 28 sets of results as the final coding. Subsequently, 4mCpred-EL trained 4 submodels through the final coding and these 4 ML algorithms and then combined the 4 submodels into the final model by majority voting. i4mC-Mouse trained 6 submodels using 6 feature encoding schemes and random forest (RF for short) algorithm, and then the 6 submodels were combined into the final model by weighted voting. Compared with 4mCpred-EL, i4mC-mouse has better performance according to the indicators, ACC and MCC. Although exciting results have been achieved in 4mCpred-EL and i4mC-Mouse, the performance is able to be further increased. In this paper, to further improve the prediction capability, we propose a new mouse's 4mCs predictor, i4mC-EL.

2. Materials and Methods

2.1. Framework of i4mC-EL

In the present study, a novel model named i4mC-EL is proposed to indentify mouse's 4mCs, and we can see the framework of it in Figure 1. First, using two different feature encoding schemes, Kmer and EIIP, each DNA sequence was encoded into a 1364-dimensional vector and a 41-dimensional vector, respectively. Next, the 1364-dimensional vector and the 41-dimensional vector of each DNA sequence were combined to form a 1405-dimensional multifeature vector. Finally, a two-stage stacked ensemble learning classifier with these multifeature vectors as input was constructed. The ensemble classifier used BayesNet, NavieBayes Multinomial, LibSVM, and Voted Perceptron as base classifiers and used Logistic as metaclassifier. i4mC-EL's datasets and feature encoding schemes and classifiers will be described detailly below.
Figure 1

The framework of i4mC-EL.

2.2. Dataset

This paper adopted the benchmark dataset constructed by Hasan'steam [21]. In this dataset, the positive samples containing mouse's 4mCs were obtained from the MethSMRT [17] database, and the negative samples were taken from chromosome DNA sequences. They were all fragments of DNA sequences consisting of 41 nucleotides with a “C” in the middle. Only the sequences whose modQV value greater than or equal to 20 were considered to obtain the high-quality dataset. To prevent the predictor from overfitting, the threshold of CD-HIT [22] was set to 70% to remove redundant sequences [23]. The dataset contained 1,812 DNA sequences, 906 of which were 4mCs and 906 were non-4mCs. About 80% of the dataset was randomly selected as the training dataset, and the remaining about 20% was used as the independent test dataset. The training dataset (train-1492) consisted of 746 4mCs and 746 non-4mCs. And the independent test dataset (test-320) included 160 4mCs and 160 non-4mCs.

2.3. Feature Encoding

Transforming DNA sequences into vectors that can make a distinction between 4mCs and non-4mCs availably is the first step to build an ensemble learning-based predictor to identify 4mCs [24-29]. Here, a multifeature encoding scheme composed of Kmer [30-33] and EIIP [34] was used to encode DNA sequences. Kmer represented the DNA sequences as the occurrence frequencies of k adjacent nucleotides. EIIP encoded each nucleotide in DNA sequences with its corresponding electron-ion energy. In the experiment of Section 3, we will find that this multifeature is able to encode DNA sequences availably. The following parts are detailed descriptions of Kmer and EIIP.

2.3.1. Kmer

This encoding scheme refers to the frequency of k-nucleotides composed of k continuous nucleotides in each sequence. For sequence D = d1d2d3 ⋯ ddd, each element of each feature vector is calculated by Equation (1): where X is one of the k-nucleotide, F(X)and f(X) are the count and frequency of X in D, respectively, and L is D's length. After Kmer, sequences are transformed into 4-dimensional vectors. For example, when the k-mer parameter k = 2, the value of AA in the 16-dimensional (42) feature vector of sequence D1 = AAACTAGTC is 0.25. In the present study, we choose the values of the parameter k to be 1, 2, 3, 4, and 5, generating 1364-dimensional (41 + 42 + 43 + 44 + 45) feature vectors.

2.3.2. EIIP

EIIP is the short name of electron-ion interaction pseudopotential. The encoding scheme based on EIIP was proposed by Nair and Sreenadhan in 2006. Through it, each nucleotide in each sequence is replaced by its corresponding electron-ion interaction pseud potential value (Table 1). For example, the result of sequence D2 = AACTG after EIIP encoding is (0.1260, 0.1260, 0.1340, 0.1335, 0.0806). In the present study, each sequence is transformed into a 41-dimensional feature vector.
Table 1

The electron-ion interaction pseudopotential values for DNA nucleotides.

NTACGT
EIIP0.12600.13400.08060.1335

2.4. Classifier

As an open data mining platform, Weka has assembled a large number of machine learning algorithms that can undertake data mining tasks. In the present paper, the classifiers we used were all implemented by Weka, such as BayesNet, NaiveBayes, SGD, SimpleLogistic, SMO, IBk, JRip, J48, and ensemble learning. Finally, we chose the ensemble learning, and the results of related experiment will be presented in section 3. According to different combination strategies, bagging, boosting, and stacking are the three main types of ensemble learning. Ensemble learning is widely used in bioinformatics because it can improve the prediction performance of classifiers, such as protein-protein interaction [35], disease prediction [36], type III secreted effectors prediction [37], and protein subcellular location prediction [38]. In detail, we used two-stage stacked ensemble learning. In the two-stage stacked ensemble learning, the base classifiers used in this paper include BayesNet [39], Voted Perceptron [40], Naive Bayes Multinomial [41], and LibSVM [42], and the metaclassifier was Logistic. At the first stage of the ensemble learning classifier, based on the multifeature vectors proposed in this paper, four base classifiers are, respectively, trained to relabel the training dataset and the independent test dataset. At the second stage, the outputs of base classifiers are utilized as input for the metaclassifier. Figure 2 gives the detailed process of model generation and result output, the steps are as follows.
Figure 2

Working diagram of ensemble learning.

Step 1 .

Partition dataset. Divide the training dataset into ten parts and mark them as train 1, train 2,…, train 10. The independent test dataset remains unchanged.

Step 2 .

Train base classifiers. In the present paper, we chose BayesNet, Voted Perceptron, Naive Bayes Multinomial, and LibSVM as base classifiers. For one base classifier such as BayesNet, 10-fold crossvalidation is performed. In detail, train 1, train 2,…, train 10 are used as validation dataset in turn, the other nine parts are used as the training dataset, and prediction is made on the independent test dataset. This would get 10 predictions from the training dataset together with another 10 predictions on the independent test dataset. Combine the 10 predictions on the training dataset vertically to get A1 and take the average of the 10 predictions on the independent test dataset to get B1. Similarly, we could get A2, B2 from NavieBayes Multinomial, A3, B3 from LibSVM, and A4, B4 from Voted Perceptron.

Step 3 .

Train metaclassifiers. Use the predictive values of the 4 base classifiers on the training dataset, A1, A2, A3, and A4, as 4 features to train the logistic classifier.

Step 4 .

Predict new data. Use the trained model to make predictions on the 4 features, B1, B2, B3, and B4, constructed from the predicted values of the independent test dataset of the 4 base classifiers, and then the final prediction results are obtained.

2.5. Performance Evaluation

For the sake of validating the quality of our classification predictor, we used four indicators widely adopted in the field of bioinformatics for evaluation [43-53]. These indicators can be calculated using the formulas below: where TP indicates the number of the sequences that they are actually 4mCs, and that they are identified as 4mCs by the model, FP indicates the number of the sequences that they are actually non-4mCs but that they are identified as 4mCs by the model, TN indicates the number of the sequences that they are actually non-4mCs, and that they are identified as non-4mCs by the model, FN indicates the number of the sequences that they are actually 4mCs but that they are identified as non-4mCs by the model. The Sn refers to the prediction accuracy of 4mCs. The Sp refers to the prediction accuracy of non-4mCs. ACC refers to the prediction accuracy of both 4mCs and non-4mCs. MCC represents the reliability of the prediction results. The higher the values of the above four indicators have, the more superior the capability of the predictor would be.

3. Results and Discussion

3.1. Crossvalidation Results of TRAIN-1492

To find the features that can adequately represent the structure and function of the DNA sequences, we attempted to contrast numerous feature encoding schemes. And to achieve the optimal accuracy, we also tried to train the model using several different classification algorithms. The results of relevant comparative experiments are as below.

3.1.1. Feature Encoding Comparison on Crossvalidation

AS shown in section of “feature encoding,” we encode the DNA sequences with a multifeature, which combines k-mer and EIIP feature encoding method. To verify the validity of the proposed multifeature, we compare the proposed multifeature with BPF, DPE, RFHC, RevKmer, and PseKNC feature encoding schemes and their combinations using ensemble learning classification. Among them, BPF and DPE are encoding schemes based on nucleotide positions, in which BPF takes mononucleotides as its encoding targets, while DPE takes dinucleotides as its encoding targets. RFHC is an encoding scheme based on the physicochemical properties of nucleotides. RevKmer is a variant of Kmer that considers not only the current k-nucleotides themselves, but also their reverse complementary nucleotides. PseKNC is a method to integrate continuous local and global k-tuple nucleotide information into the feature vectors of DNA sequences. Table 2 displays experimental results, in which “our method” denotes the multifeature mentioned in the section “feature encoding.” As shown in Table 2, from the perspective of ACC and MCC, the index values of our method are higher than those of all other feature encoding schemes, which indicates that our method has a better overall performance. From the perspective of Sp, the index value of our method is still the highest, which indicates that it is more dominant to identify non-4mC from negative samples. These conclusions demonstrate that our method has good validity.
Table 2

The contrast of performance for dissimilar feature encoding schemes under 10-fold crossvalidation.

SchemesACCMCCSnSp
BPF0.6680.3350.6650.670
DPE0.6140.2280.6190.609
RFHC0.6580.3160.6690.647
RevKmer0.7550.5110.7450.765
PseKNC0.7940.5890.7860.803
k-mer + BPF0.7240.4480.7290.718
k-mer + RFHC0.7470.4930.7440.749
RevKmer+DBE0.7380.4760.7230.753
RevKmer+EIIP0.7790.5580.7640.794
k-mer + BPF + DPE0.7320.4640.7410.723
Our method0.8030.6060.7840.822
To further illustrate the prediction capability of our selected multifeature encoding scheme, the ROC curves for dissimilar feature encoding schemes under 10-fold crossvalidation are displayed in Figure 3. From Figure 3, we can see that our method has the largest area under ROC curve (AUC), which demonstrates that our method can represent mouse's DNA sequences better than others.
Figure 3

ROC curves for dissimilar feature encoding schemes under 10-fold crossvalidation.

3.1.2. Classifier Comparison on Crossvalidation

As shown in the section “classifier,” we inputted the multifeature composed of k-mer and EIIP into an ensemble learning classifier called stacking, then obtained a predictor which is used for identifying mouse's 4mCs. To verify the validity of stacking used in this paper, on the basis of the multifeature used in this paper, we compared stacking with eleven commonly used classifiers, BayesNet, Naive Bayes, SGD, Simple Logistic, SMO, IBK, JRip, J48, Random Forest, AdaBoostM1, and Bagging. Among them, BayesNet characterizes the dependencies among attributes with the aid of directed acyclic graphs and uses conditional probability tables to describe the joint probability distribution of attributes. NaiveBayes is a simple probabilistic classifier based on Bayes' theorem under the assumption that each attribute is independent of each other. SGD implements a regularized linear support vector machine classifier with stochastic gradient descent learning. Simple Logistic is a linear logistic regression classifier with only one independent variable. SMO is a support vector machine classifier using a continuous minimum optimization algorithm. IBk classifies the data point by determining the category of k data points closest to it. JRip is a classifier based on rule induction. J48 is a decision tree classifier that uses information gain rate to select attributes for partitioning. Random Forest refers to a classifier that utilizes multiple trees to train and predict a sample. AdaBoostM1 is a classifier that enables the previously incorrectly predicted training samples to receive more attention at follow-up by adjusting their distribution. Bagging uses bootstrap sampling to obtain m (m is the predetermined number of base classifiers) sample datasets from the original dataset, which are used to train m base classifiers that are then integrated by voting. The results of these comparative experiments are displayed in Table 3, where “our method” refers to the stacking classifier. From Table 3, we can see that our method outperforms the other classifiers in all indicators.
Table 3

The contrast of performance for dissimilar classifiers under 10-fold crossvalidation.

ClassifiersACCMCCSnSp
BayesNet0.7270.4530.7390.714
NaiveBayes0.7520.5040.7510.753
SGD0.7120.4240.7100.713
SimpleLogistic0.7610.5220.7530.768
SMO0.7020.4050.7060.698
IBk0.6370.2760.5840.690
JRip0.7070.4140.6920.723
J480.6650.3300.6740.655
RandomForest0.7700.5410.7530.787
AdaBoostM10.7130.4270.7390.688
Bagging0.7290.4590.7440.714
Our method0.8030.6060.7840.822
To further illustrate the classification capability of our selected stacking classifier, the ROC curves for dissimilar classifiers under 10-fold crossvalidation are displayed in Figure 4. From Figure 4, we can see that the area under ROC curve (AUC) of our method is the largest, which proves that our proposed method has better prediction performance for identifying 4mCs in the mouse genome than other methods.
Figure 4

ROC curves for dissimilar classifiers under 10-fold crossvalidation.

3.2. Independent Validation Results of TEST-320

In this section, a comparative experiment on the independent test dataset (TEST-320) will be conducted to show the generalization capability of our selected multifeature and stacking classifier. The rationale for this is that this model is trained and tested on two different datasets, which is the equivalent of performing a real prediction task with the generated model.

3.2.1. Feature Encoding Comparison on Independent Validation

Using the stacking classifier, we, respectively, evaluate the generalization capability of various feature encoding schemes described in Section 3.1.1 on TEST-320. Table 4 displays these comparison experimental results. From Table 4, among the compared feature encoding schemes, our method performed best in ACC, Sn, and MCC, which were 82.19%, 0.806, and 0.644, respectively. Although the Sp of our method is lower than that of BPF, k-mer + BPF, k-mer + RFHC, k-mer + BPF + DPE, and PseKNC+EIIP+RFHC, the other three indicators of our method are higher than theirs.
Table 4

The contrast of performance for dissimilar feature encoding schemes on TEST-320.

SchemesACCMCCSnSp
BPF0.7530.5300.6060.900
DPE0.6970.4010.6000.794
RFHC0.7160.4380.6310.800
RevKmer0.6660.3350.7440.588
PseKNC0.7810.5630.7880.775
k-mer + BPF0.7720.5530.6810.863
k-mer + RFHC0.8000.6140.6940.906
RevKmer+DBE0.7560.5160.7000.813
RevKmer+EIIP0.7130.4270.7630.663
k-mer + BPF + DPE0.7720.5530.6810.863
Ourmethod0.8220.6440.8060.838
For the sake of further describing the generalization capability of our selected multifeature encoding scheme, Figure 5 displays the ROC curves for dissimilar feature encoding schemes on TEST-320. From Figure 5, we can see that the AUC of our method is the largest, and the ROC curve of our method is closer to the upper left, which demonstrates that our selected multifeature is more suitable than other schemes to encode the DNA sequences used to recognize mouse's 4mC.
Figure 5

ROC curves for dissimilar feature encoding schemes on TEST-320.

3.2.2. Classifier Comparison on Independent Validation

We compared stacking classifier used in this paper with other eleven classifiers on TEST-320 under the condition of using the multifeature combing k-mer and EIIP as the input of the stacking. The results of these comparative experiments are displayed in Table 5, from which we can see that although the Sp of BayesNet is a little higher than that of our method, our method outperforms other classifiers in ACC, Sn, and MCC. Overall, our selected stacking classifier performs better than the others, indicating that it is effective for identifying mouse's 4mC.
Table 5

The contrast of performance for dissimilar classifiers on TEST-320.

ClassifiersACCMCCSnSp
BayesNet0.7690.5470.6750.863
NaiveBayes0.7880.5770.7440.831
SGD0.6880.3790.7560.619
Simple Logistic0.7280.4560.7380.719
SMO0.6750.3530.7440.606
IBk0.6000.2010.5630.638
JRip0.7690.5410.7130.825
J480.6630.3250.6560.669
Random Forest0.7780.5580.7380.819
AdaBoostM10.7910.5810.7940.788
Bagging0.7810.5640.7440.819
Our method0.8220.6440.8060.838
For the sake of further describing the generalization capability of our selected stacking classifier, the ROC curves for dissimilar classifiers on TEST-320 are displayed in Figure 6, where we can get the conclusion that the AUC of our method is the largest too, which proves that our proposed stacking-based ensemble classifier method is more suitable for the identification of mouse's 4mCs than other classifiers.
Figure 6

ROC curves for dissimilar classifiers on TEST-320.

3.3. Contrast with Extant Models on TEST-320

Here, we contrasted i4mC-EL with 4mCpred-EL and i4mC-Mouse on TEST-320 for the sake of further evaluating its performance. Table 6 displays these contrast experimental results, in which the data of 4mCpred-EL and i4mC-Mouse are from reference. From Table 6, we can see that i4mC-EL is superior to 4mcPred-EL and i4mC-Mouse in three indexes which are ACC, Sp, and MCC. Although the Sn of i4mC-Mouse is a little higher than that of our method, our method outperforms i4mC-Mouse in the other three indexes. All in all, i4mC-EL performs better than extant methods.
Table 6

The contrast of performance for dissimilar models on TEST-320.

ModelsACCMCCSnSp
4mcPred-EL0.7910.5840.7570.825
i4mC-Mouse0.8160.6330.8070.825
i4mC-EL0.8220.6440.8060.838

4. Conclusions

In the present paper, an ensemble learning model called i4mC-EL which was able to identify mouse's 4mC sites was designed. In the process of constructing i4mC-EL, to determine the optimal combination of feature encoding schemes and classifiers, we conducted abundant comparative experiments on dissimilar features and classifiers. Finally, we encoded DNA sequences with multifeatures combing k-mer and EIIP, then used two-stage stacked ensemble learning as classifier. We used BayesNet, NavieBayes Multinomial, LibSVM, and VotedPerceptron as base classifiers and Logistic as metaclassifier. In addition, we contrasted i4mC-EL with existing models for the sake of proving its effectiveness. The results show that i4mC-EL is better than the existing models and has better generalization capability. In summary, i4mC-EL is effective in predicting the 4mC sites in the mouse genome, which helps us to understand the biochemical properties of 4mC. We will use adaptive feature vectors to donate DNA sequences to optimize the feature encoding scheme [54, 55] in the future work. Furthermore, other improvements, encoding schemes, classifier algorithms, and intelligent computing models to identify 4mC sites will also be considered.
  36 in total

1.  N6-methyldeoxyadenosine marks active transcription start sites in Chlamydomonas.

Authors:  Ye Fu; Guan-Zheng Luo; Kai Chen; Xin Deng; Miao Yu; Dali Han; Ziyang Hao; Jianzhao Liu; Xingyu Lu; Louis C Dore; Xiaocheng Weng; Quanjiang Ji; Laurens Mets; Chuan He
Journal:  Cell       Date:  2015-04-30       Impact factor: 41.582

2.  MotifCNN-fold: protein fold recognition based on fold-specific features extracted by motif-based convolutional neural networks.

Authors:  Chen-Chen Li; Bin Liu
Journal:  Brief Bioinform       Date:  2020-12-01       Impact factor: 11.622

3.  A comparison and assessment of computational method for identifying recombination hotspots in Saccharomyces cerevisiae.

Authors:  Hui Yang; Wuritu Yang; Fu-Ying Dao; Hao Lv; Hui Ding; Wei Chen; Hao Lin
Journal:  Brief Bioinform       Date:  2019-10-21       Impact factor: 11.622

4.  SelfAT-Fold: Protein Fold Recognition Based on Residue-Based and Motif-Based Self-Attention Networks.

Authors:  Yihe Pang; Bin Liu
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2022-06-03       Impact factor: 3.710

5.  BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches.

Authors:  Bin Liu; Xin Gao; Hanyu Zhang
Journal:  Nucleic Acids Res       Date:  2019-11-18       Impact factor: 16.971

6.  Gene2vec: gene subsequence embedding for prediction of mammalian N 6-methyladenosine sites from mRNA.

Authors:  Quan Zou; Pengwei Xing; Leyi Wei; Bin Liu
Journal:  RNA       Date:  2018-11-13       Impact factor: 4.942

7.  Meta-4mCpred: A Sequence-Based Meta-Predictor for Accurate DNA 4mC Site Prediction Using Effective Feature Representation.

Authors:  Balachandran Manavalan; Shaherin Basith; Tae Hwan Shin; Leyi Wei; Gwang Lee
Journal:  Mol Ther Nucleic Acids       Date:  2019-04-30

8.  CD-HIT: accelerated for clustering the next-generation sequencing data.

Authors:  Limin Fu; Beifang Niu; Zhengwei Zhu; Sitao Wu; Weizhong Li
Journal:  Bioinformatics       Date:  2012-10-11       Impact factor: 6.937

9.  Computational identification and characterization of miRNAs and their target genes from five cyprinidae fishes.

Authors:  Yong Huang; Hong-Tao Ren; Quan Zou; Yu-Qin Wang; Ji-Liang Zhang; Xue-Li Yu
Journal:  Saudi J Biol Sci       Date:  2015-05-13       Impact factor: 4.219

10.  4mCpred-EL: An Ensemble Learning Framework for Identification of DNA N4-methylcytosine Sites in the Mouse Genome.

Authors:  Balachandran Manavalan; Shaherin Basith; Tae Hwan Shin; Da Yeon Lee; Leyi Wei; Gwang Lee
Journal:  Cells       Date:  2019-10-28       Impact factor: 6.600

View more
  1 in total

1.  Systematic Analysis and Accurate Identification of DNA N4-Methylcytosine Sites by Deep Learning.

Authors:  Lezheng Yu; Yonglin Zhang; Li Xue; Fengjuan Liu; Qi Chen; Jiesi Luo; Runyu Jing
Journal:  Front Microbiol       Date:  2022-03-15       Impact factor: 5.640

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.