Y Y Zhuang1, H J Liu2, X Song3, Y Ju1, H Peng1. 1. School of Informatics, Xiamen University, Xiamen 361005, China. 2. College of Information Technology and Computer Science, University of the Cordilleras, Baguio 2600, Philippines. 3. School of Computer and Information Technology, Nanyang Normal University, Nanyang 473000, China. Electronic address: sxyoland@foxmail.com.
Abstract
N6-methyladenosine (m6A) is one of the most common and abundant modifications in RNA, which is related to many biological processes in humans. Abnormal RNA modifications are often associated with a series of diseases, including tumors, neurogenic diseases, and embryonic retardation. Therefore, identifying m6A sites is of paramount importance in the post-genomic age. Although many lab-based methods have been proposed to annotate m6A sites, they are time consuming and cost ineffective. In view of the drawbacks of the intrinsic methods in RNA sequence recognition, computational methods are suggested as a supplement to identify m6A sites. In this study, we develop a novel feature extraction algorithm based on the frequent gapped k-mer pattern (FGKP) and apply the linear regression to construct the prediction model. The new predictor is used to identify m6A sites in the Saccharomyces cerevisiae database. It has been shown by the 10-fold cross-validation that the performance is better than that of recent methods. Comparative results indicate that our model has great potential to become a useful and effective tool for genome analysis and gain more insights for locating m6A sites.
N6-methyladenosine (m6A) is one of the most common and abundant modifications in RNA, which is related to many biological processes in humans. Abnormal RNA modifications are often associated with a series of diseases, including tumors, neurogenic diseases, and embryonic retardation. Therefore, identifying m6A sites is of paramount importance in the post-genomic age. Although many lab-based methods have been proposed to annotate m6A sites, they are time consuming and cost ineffective. In view of the drawbacks of the intrinsic methods in RNA sequence recognition, computational methods are suggested as a supplement to identify m6A sites. In this study, we develop a novel feature extraction algorithm based on the frequent gapped k-mer pattern (FGKP) and apply the linear regression to construct the prediction model. The new predictor is used to identify m6A sites in the Saccharomyces cerevisiae database. It has been shown by the 10-fold cross-validation that the performance is better than that of recent methods. Comparative results indicate that our model has great potential to become a useful and effective tool for genome analysis and gain more insights for locating m6A sites.
Over 100 modifications occur in RNA. The functions of internal modifications of mRNA are used to keep the stability of mRNA, and the most common internal modifications of mRNA include N6-methyladenosine (m6A), N1-methyladenosine (m1A), 5-methylcytosine (m5C). Among them, global scientists have verified many enzymes that m6A engages, such as histone demethylases, methylase, and methylation recognition enzyme. Abnormal m6A modifications are often related to a series of diseases, including tumors, neurogenic diseases, and embryonic retardation. RNA m6A was first observed in 1970s. Since then, m6A is found in a wide spectrum of all living organisms and linked to many important roles of biological activities, including mRNA splicing, stability, nuclear processing, and immune response.5, 6, 7, 8 Therefore transcriptome-wide annotation of m6A sites will be helpful to understand its biological functions.In the past few years, high-throughput sequencing techniques such as MeRIPSeq and m6A-seq have identified m6A peaks in Saccharomyces cerevisiae, Mus musculus, and Homo sapiens. At the same time, the miCLIP technique was proposed to provide the recognition method of m6A sites in the human transcriptome. However, in consideration of the biological inherent reliance of the techniques, they are still neither budget nor time efficient in performing transcriptome-wide analysis.Although lab-based technologies have been widely applied to identify m6A, some cost-effective computational methods are developed in assisting the process as well. To identify methylated m6A sites, building a high-resolution database is of paramount importance in predicting m6A sites. Using the high-resolution database of Saccharomyces cerevisiae constructed by Schwartz et al., Chen et al.14, 15, 16, 17, 18 proposed a series of predictors such as “iRNA-Methyl,” “M6ATH,” “MethyRNA,” “iRNA-3typeA” and “iRNA(m6A)-PseDNC,” which formulated RNA sequences by using different combinations of feature extractions and classifiers to make predictions. Feng et al. used a method called “iRNA-PseColl,” which incorporated collective features of the RNA sequence elements into PseKNC to make predictions. Jaffrey et al. built a single-nucleotide resolution map of m6A sites across Homo sapiens. More recently, Chen et al. proposed a support-vector-machine-based method to predict m6A sites in Arabidopsis thaliana. As mentioned in some references, well-established ensemble classifiers have been proven to outperform single classifiers.21, 22, 23 Based on this, Wei et al. thus proposed an m6A predictor by constructing an ensemble classifier based on the support vector machine (SVM) to successfully improve the predictive performance. Wei et al., have also done a lot of research with the ensemble classifier, which has great significance for reference in our study.In this article, we propose a novel method for the identification of m6A sites within RNA sequences. As for feature representation, we use the frequent gapped k-mer pattern (FGKP) discovery algorithm to mainly capture the properties in RNA sequences. In the predictive model, we use the linear regression to discriminate the positive and negative samples. Experimental results show that our model outperformed other existing methods in the literature under the 10-fold cross-validation test.
Results
Several diseases have their underlying causes in RNA,, including cancers.29, 30, 31 In our study, we combined the advantage of effective extraction of frequent gapped k-mer (FGK) and the strong ability of classification of the linear predictive model to create a powerful predictive tool in order to discriminate the positive and negative samples of m6A. The learning machine that we used was logistic regression (LR). We have experimented with our predictor in the Saccharomyces cerevisiae genome using 10-fold cross-validation. It turns out that our model is superior to M6A-HPCS, the recent classifier in this area, and also has a better performance than other feature extractions and different parameters within our model. We anticipate that it will shed some light on genome analysis in future practice.
Four Evaluation Metrics
In general, the following four metrics are used to measure the quality of a predictor: sensitivity (SN), specificity (SP), accuracy (ACC), and Matthews correlation coefficient (MCC). These metrics were first introduced by Chou and then they were widely applied to a wide range of biological areas (see Liu et al.,34, 35, 36, 37 Ehsan et al., Feng et al., Song et al., Lin et al., and Xu et al.,). Their definitions are as follows:andwhere TP, TN, FP, and FN are true positive, true negative, false positive, and false negative, respectively. In this research, TP represents the true m6A site predicted correctly, TN represents the non-m6A site predicted incorrectly, FP represents the non-m6A site predicted incorrectly as the true m6A site, and FN represents the non-m6A site predicted correctly as the non-m6A site. The values of SN, SP, ACC, are between 0 and 1. The closer to 1 they get, the more accuracy our model achieves; the value of MCC is between −1 and 1. The larger the value that MCC gets, the better performance our prediction model obtains.
Cross-Validation
Normally, three types of validation are used to derive the metric values: independent test sets, subsampling (or K-fold cross-validation), and the jackknife test (or LOOCV). Although the jackknife test can fully train the data we already have to acquire a more accurate classifier, and it has definite sampling and error estimation based on the specific dataset, the jackknife test is not a time-efficient method compared with the other two types of validation. In this article, we adopted the 10-fold cross-validation method used by many researchers42, 43, 44 in this area.
ROC Curve
ROC curve (also called the sensitivity curve) is the abbreviation for receiver operating characteristic curve. Every point on the curve reflects the same sensitivity. They react to the same signal simulation in the different judgment standards. Therefore, the ROC curve can be generally treated as the overall performance in the binary classification problems. The ROC curve is normally plotted with the x-axis true-positive rate (TPR) and the y-axis false-positive rate (FPR) in the different thresholds of the classification. We can understand the TPR as the sensitivity as described earlier, and the FPR can be computed as 1 − specificity. The area under the ROC curve (AUROC) can also be calculated. The AUROC is the indicator of the performance of a predictor. The AUROC ranges from 0.5 to 1. The closer the AUROC score of a predictor to 1, the better and more robust the predictor we can reckon, and we can deem the AUROC score of 0.5 of a predictor as a random predictor.
Discussion
Comparison among Different Feature Extractions
To justify our feature extraction technique, we make comparisons with two of the most commonly used feature representation techniques, Triplet and Pse-SSC, and this shows that the FGK method gets the much better performance than the other two feature representations. We show the result in Table 1, and from Figure 1, we can see the graphical comparisons from four different evaluation metrics. The FGK leads Pse-SSC by 4% and Triplet by 17% for the ACC, and for the MCC metric, FGK outnumbers its counterparts by over 10%. From Figure 2, we can see the effects of three different feature extractions from their ROC curves. The larger areas under the curve we get, the better performance the method achieves. Also, we can also see from Table 1 that our feature representation is 63.2% and 16.4% higher than features Pse-SSC and Triplet, respectively.
Table 1
Comparison of Different Feature Extractions
Feature
SP (%)
SN (%)
ACC (%)
MCC
AUROC
Triplet
56.92
63.85
59.92
0.20
0.6669
Pse-SSC
78.77
64.66
72.52
0.44
0.7284
Frequent gapped k-mer
71.92
83.62
77.10
0.55
0.8307
Figure 1
Performance of Different Feature Extractions Using 10-Fold Cross-validation
Here, we compare the effect of our feature extraction (FGK) with Pse-SSC and Triplet methods.
Figure 2
ROC Curves of Frequent Gapped K-mer, Pse-SSC, and Triplet and Their AUROC Values
Comparison of Different Feature ExtractionsPerformance of Different Feature Extractions Using 10-Fold Cross-validationHere, we compare the effect of our feature extraction (FGK) with Pse-SSC and Triplet methods.ROC Curves of Frequent Gapped K-mer, Pse-SSC, and Triplet and Their AUROC Values
Comparison with Other Classifiers
In Table 2, we compare LR with SVM and random forest (RF). The reason for choosing SVM and RF for comparison is because SVM,,, and RF,47, 48, 49, 50 are two of the most widely used classifiers in bioinformatics. Although the SP of the proposed method is lower than those of SVM and RF, its SN, ACC, and MCC are higher than those of SVM and RF, indicating that the performance of the LR-based model can effectively discriminate the m6A sites in Saccharomyces cerevisiae. We can see the overall performance of three classifiers in Figure 3. In this figure, we can see that, although the SP of LR performs poorly compared to that of the other two classifiers, the other three metrics are much better than the rest for the two predictors. The ACC of LR is far better than that of SVM, topping by almost 30% and slightly exceeding by 3.5% the ACC of RF.
Table 2
Performance Comparison of Different Classifiers
Classifier
SP (%)
SN (%)
ACC (%)
MCC
SVM
80
46.83
48.09
0.10
RF
75.51
72.56
73.66
0.47
LR
71.92
83.62
77.10
0.55
Figure 3
Comparison of Performances among the LR Classifier and Other Popular Classifiers (SVM and RF) with the Same Learning Feature Representations on the S. cerevisiae Dataset
Performance Comparison of Different ClassifiersComparison of Performances among the LR Classifier and Other Popular Classifiers (SVM and RF) with the Same Learning Feature Representations on the S. cerevisiae Dataset
Comparison with Different Parameters
In Table 3, we compared the model prediction performance of linear regression using different parameters and found that, with parameters k = 4 and γ = 0.025, we get the most desirable result. The classifier with parameters k = 5 and γ = 0.05 is almost 25% higher than its counterpart in ACC.
Table 3
Performance Comparison of Different Parameters in Our Model
Classifier
SP (%)
SN (%)
ACC (%)
MCC
LR (k = 5, γ = 0.05)
73.53
73.81
73.66
0.47
LR( k = 4, γ = 0.025)
71.92
83.62
77.10
0.55
Performance Comparison of Different Parameters in Our Model
Comparison with Existing Predictors
To evaluate the performance of our proposed predictor, we compared our predictor with two existing predictors, iRNA-Methyl and M6A-HPCS. The reason to choose these two predictors for comparison is that they have been reported to achieve outstanding performance in m6A site identification. For fairness of comparison, all compared predictors are trained and validated on the same benchmark dataset. The results are summarized in Table 4. It can be observed that, among the compared predictors, the proposed model obtains the best performance in terms of ACC and MCC, with 77.10% and 55%, respectively. Compared with the best of the existing predictors, M6A-HPCS, our classifier performance is about 10% higher for ACC and 20% higher for MCC.
Table 4
Comparison of M6APred-FG with Other Well-Known Classifiers
Prediction Method
SP (%)
SN (%)
ACC (%)
MCC
iRNA-Methyl
60.63
70.59
65.59
0.29
M6A-HPCS
62.89
71.77
67.33
0.35
iRNA-Freq
71.92
83.62
77.10
0.55
Comparison of M6APred-FG with Other Well-Known Classifiers
Materials and Methods
Framework of the Proposed Predictor
Figure 4 shows the flowchart of the proposed predictor. The first stage is to collect data from verified databases and relevant literature.,, In this research, we use the organized dataset from Chen et al.’s work. The second stage is feature encoding. This stage includes feature representation and feature optimization. Feature representation means extracting characteristics of RNA sequences using various feature descriptors, including composition features like Dinucleotide-based auto covariance (DAC), physicochemical features like PC-PseDNC-General, and our newly found FGKP. The final stage is to train the machine learning model (i.e., SVM, RF, and linear regression) using the feature extraction from the last stage. The predictive model constructed is based on the feature extraction mentioned earlier and validated through validation methods. In this study, we used the 10-fold cross-validation test.
Figure 4
Flowchart of the Proposed Predictor
Stage 1 shows the procedure of dataset preparation. We chose a benchmark database and used updated literature to obtain candidate peptides. Since the candidate peptides have imbalanced positive and negative samples, we needed to balance the samples (or reduce redundancy) to get the primary dataset. Then, we divided the dataset into the test dataset and the train dataset. Stage 2 shows the feature encoding or feature extraction. In our sample sequences, there is information hidden. We needed to find a way to extract their features to best represent the original samples and digitalize them. Stage 3 shows how we used the train dataset and chose the appropriate model to gain a prediction model and evaluate it. Stage 4 shows how we tested and validated our predictive model. In our article, we combined stages 3 and 4 together using 10-fold cross-validation to evaluate our model.
Flowchart of the Proposed PredictorStage 1 shows the procedure of dataset preparation. We chose a benchmark database and used updated literature to obtain candidate peptides. Since the candidate peptides have imbalanced positive and negative samples, we needed to balance the samples (or reduce redundancy) to get the primary dataset. Then, we divided the dataset into the test dataset and the train dataset. Stage 2 shows the feature encoding or feature extraction. In our sample sequences, there is information hidden. We needed to find a way to extract their features to best represent the original samples and digitalize them. Stage 3 shows how we used the train dataset and chose the appropriate model to gain a prediction model and evaluate it. Stage 4 shows how we tested and validated our predictive model. In our article, we combined stages 3 and 4 together using 10-fold cross-validation to evaluate our model.
Datasets
m6A sites have been widely identified in Saccharomyces cerevisiae,
Homo sapiens,,
Mus musculus, and Arabidopsis thaliana. In this work, we used the dataset from Saccharomyces cerevisiae. In the Saccharomyces cerevisiae genome, m6A sites have the same motif, GAC, and they are more easily methylated. Since RNA sequences in Saccharomyces cerevisiae have different lengths, we used the organized dataset from Chen et al.’s work. There are 1,307 positive samples and 1,307 negative samples, where the negative samples were randomly collected from 33,280 sequences with non-m6A sites. All sequences in the dataset are 51 nt long (25 nt on each side of the m6A/non-m6A sites), with the sequence similarity less than 85%.
Representation of RNA Sample
The RNA samples in our dataset can be generally expressed as the following pattern:whereThe first thing we would need to do is to transform the RNA sequence in Equation 5 to a vector. However, a vector might lose its sequential information and pattern. In order to solve the problem, we introduce the FGKP discovery algorithm that we recently found. In this method, we can separate our algorithm into four steps and elaborate each step accordingly:Search all the FGK sub-sequences from each sequence in the dataset.We find all FGK sub-sequences from each sequence in the dataset and calculate the frequency of gapped k-mer sub-sequences, and we can set the frequency threshold here. Here, the parameter k means the matching length of the sub-sequences, and we denote the frequency threshold as .Build a set for the frequent sub-sequences.FGK are subjects and whose lengths over a threshold is an attribute clause which modifies the subjects. We can map each FGK sub-sequence into a column of the table as shown in Figure 5.
Figure 5
The Transformation from the Original Samples to 0–1 Sequences
Utilize the frequent k-mer sub-sequence set as features to generate vectors.The Transformation from the Original Samples to 0–1 SequencesFirst of all, we define the following functions:Here, denotes the sequence that is predicted, and denotes the j-element of the frequent k-mer sub-sequence set. As you can see from the function in Equation 6, we define a function , which compares the predicted sequence and the j-element of the frequent k-mer sequence set, and we discriminate the perfect matching between and using 1 and 0 otherwise. After this procedure, we map the sequence using the function to a 0–1 vector as shown in the function in Equation 7.
Linear Predictive Model
Although a huge amount of literature is related to classification methods such as SVM,,54, 55, 56, 57, 58, 59, 60, 61, 62 and RF,,47, 48, 49, 50 as we can see from the feature representation algorithm of RNA sample, a series of sparse data is produced. Therefore, the need to deal with a large amount of sparse data is imperative. The linear predictive model is a linear classifier for processing a large amount of sparse data with a large number of examples and features. It is a general term for supervised models, including LR, SVM, and support vector regression (SVR). In this study, we used the packages LIBSVM and LIBLINEAR. They support the multiple types of linear classifiers that we mentioned earlier. In this study, we used LR and achieved a good result. LR uses the optimal decision boundary to construct regression formula and fitted parameter sets. The main idea is as follows:Construct the prediction function , whererepresents the parameter sets of eigenvalue .As far as we know, could have a linear relationship or non-linear relationship with , as we can see from Figure 6. Normally, we can represent the linear relationship between and using the formulaand the non-linear relationship using the formulaIn linear programming, the idea of cost function is to minimize the difference of predictive result and actual ; i.e.,Then in LR, we can represent as:
Figure 6
The Linear and Non-linear Relationships between and
For details, see the Linear Predictive Model section in Materials and Methods.
Use gradient descent to calculate the maximum of .The Linear and Non-linear Relationships between andFor details, see the Linear Predictive Model section in Materials and Methods.We can achieve the maximum of through fitting parameters using the gradient of the function. For simplicity, we can consider the following cost functionand we can renew the parameter ; that is,
Author Contributions
Y.Z. conceived the project, designed the experiments, and edited the final version of the paper. H.L. performed the experiment. X.S. wrote the paper and drafted the figures. H.P. contributed to materials and data analysis.
Authors: Schraga Schwartz; Sudeep D Agarwala; Maxwell R Mumbach; Marko Jovanovic; Philipp Mertins; Alexander Shishkin; Yuval Tabach; Tarjei S Mikkelsen; Rahul Satija; Gary Ruvkun; Steven A Carr; Eric S Lander; Gerald R Fink; Aviv Regev Journal: Cell Date: 2013-11-21 Impact factor: 41.582