Rajiv G Govindaraj1, Sathiyamoorthy Subramaniyam1, Balachandran Manavalan1. 1. 1HotSpot Therapeutics, 50 Milk Street, 16 Floor, Boston, MA02109, USA; 2Research and Development Center, In-silicogen Inc., Yongin-si 16954, Gyeonggi-do, Republic of Korea; 3Department of Biotechnology, Dr. N.G.P. Arts and Science College, Coimbatore, Tamil Nadu641048, India; 4Department of Physiology, Ajou University School of Medicine, Suwon, Republic of Korea.
Abstract
INTRODUCTION: N6-methyladenosine (m6A) is one of the most common post-transcriptional modifications in RNA, which has been related to several biological processes. The accurate prediction of m6A sites from RNA sequences is one of the challenging tasks in computational biology. Several computational methods utilizing machine-learning algorithms have been proposed that accelerate in silico screening of m6A sites, thereby drastically reducing the experimental time and labor costs involved. METHODOLOGY: In this study, we proposed a novel computational predictor termed ERT-m6Apred, for the accurate prediction of m6A sites. To identify the feature encodings with more discriminative capability, we applied a two-step feature selection technique on seven different feature encodings and identified the corresponding optimal feature set. RESULTS: Subsequently, performance comparison of the corresponding optimal feature set-based extremely randomized tree model revealed that Pseudo k-tuple composition encoding, which includes 14 physicochemical properties significantly outperformed other encodings. Moreover, ERT-m6Apred achieved an accuracy of 78.84% during cross-validation analysis, which is comparatively better than recently reported predictors. CONCLUSION: In summary, ERT-m6Apred predicts Saccharomyces cerevisiae m6A sites with higher accuracy, thus facilitating biological hypothesis generation and experimental validations.
INTRODUCTION: N6-methyladenosine (m6A) is one of the most common post-transcriptional modifications in RNA, which has been related to several biological processes. The accurate prediction of m6A sites from RNA sequences is one of the challenging tasks in computational biology. Several computational methods utilizing machine-learning algorithms have been proposed that accelerate in silico screening of m6A sites, thereby drastically reducing the experimental time and labor costs involved. METHODOLOGY: In this study, we proposed a novel computational predictor termed ERT-m6Apred, for the accurate prediction of m6A sites. To identify the feature encodings with more discriminative capability, we applied a two-step feature selection technique on seven different feature encodings and identified the corresponding optimal feature set. RESULTS: Subsequently, performance comparison of the corresponding optimal feature set-based extremely randomized tree model revealed that Pseudo k-tuple composition encoding, which includes 14 physicochemical properties significantly outperformed other encodings. Moreover, ERT-m6Apred achieved an accuracy of 78.84% during cross-validation analysis, which is comparatively better than recently reported predictors. CONCLUSION: In summary, ERT-m6Apred predicts Saccharomyces cerevisiae m6A sites with higher accuracy, thus facilitating biological hypothesis generation and experimental validations.
Post-transcriptional modifications in RNA are the variations that occur on a newly transcribed primary RNA transcript. To date, approximately 150 kinds of RNA modifications have been determined [1, 2]. The most abundant RNA modification is N6-methyladenosine (m6A), which is prevalent among viruses, plants, insects, mammals, and eukaryotes such as yeast [3-7]. m6A denotes the methylation at N-6 position of adenosine nucleotide catalyzed by a methyltransferase complex and this reaction is reversible by demethylases (ALKBH5 and FTO). m6A modification has been involved in a series of biological processes, such as mRNA exporting, nascent mRNA synthesis, splicing events, nuclear translation, and translocation [8-10]. Importantly, unusual modifications of m6A have been associated with several diseases, including prostate cancer, thyroid tumor, leukemia, etc. [11-13]. Therefore, accurate identification of m6A modification sites would be of great benefit for cell biologists to better understand the disease mechanism.Several experimental approaches, including high performance liquid chromatography [14], next-generation sequencing technologies [15, 16], and two-dimensional thin layer chromatography [17] have been widely applied in the identification of m6A sites. Particularly, next-generation sequencing is not available for large-scale genomic sequences’ m6A identification. Overall, these experimental approaches are time-consuming and cost-ineffective, when applied on large-scale genome analysis. Therefore, the development of an accurate and efficient computational method for m6A identification is necessary to complement experimental approaches.Previous decade has witnessed tremendous growth in the development of various machine-learning (ML)-based methods to predict m6A sites from RNA sequences in different species, such as Homo sapiens, Saccharomyces cerevisiae, Mus musculus, and Arabidopsis thaliana. In this study, we focused on S. cerevisiae because it has been widely recognized as an attractive model organism. To date, 14 prediction models have been developed for this species to predict m6A sites. Chen et al., [18] proposed the first predictor, where they constructed a reliable benchmark dataset of 1307 positive samples (m6A sites) and an equal number of negative samples (non-m6A sites) for S. cerevisiae based on the experimental data [19]. They developed a predictor called “iRNA-Methyl” by employing SVM and pseudo-nucleotide composition features, that achieved an accuracy of 65.59%. Notably, iRNA-Methyl benchmark dataset acted as a base for the development of other methods, including m6Apred [20], pRNAm-PC [21], RNA-MethylPred [22], TargetM6A [23], M6A-HPCS [24], RAM-ESVM [25], RAM-NPPS [26], M6APred-EL [27], iRNA(m6A)-PseDNC [28], iMethyl-STTNC [29], BERMP [30], M6AMRFS [31], and iRNA-Freq [32]. Researchers employed different approaches and feature encoding schemes to improve the predictive performance. A detailed description of the performance of each method and their approaches have been provided in the recent review [33]. Although substantial growth has been made, it still remains a challenging task to extract appropriately useful features to accurately differentiate m6A sites from non-m6A sites.In the current study, we proposed a novel computational approach called ERT-m6Apred. Firstly, 14 physicochemical properties (PCPs) were incorporated into Pseudo k-tupler composition (PseKNC) and were considered as the features to differentiate m6A from non-m6A samples. To optimize the feature space, we combined F-score and sequential forward search (SFS) using extremely randomized tree (ERT) for enhancing the ability of the feature representation. Our benchmark result shows that the proposed method ERT-m6Apred can achieve improved performance when compared to the existing methods.
MATERIALS AND METHODS
Dataset
Chen et al., [18] constructed a benchmark dataset containing 2614 sequences from S. cerevisiae. Of those, 1307 sequences are positive samples (m6A sites) and 1307 sequences are negative samples (non-m6A sites). Notably, both negative and positive samples are 51-base pair (bp) long with the adenosine in the center position. Furthermore, sequence similarity of this dataset is lower than 85%. This dataset has been widely used in the development of various prediction models [8, 21, 34]. For fair comparison with the existing methods, we also employed the same benchmark dataset for model development. The benchmark dataset can be downloaded from previous work: http://server.malab.cn/M6APred-EL/.
Feature Encoding Scheme
Pseudo k-tupler Composition (PseKNC)
In this work, we utilized Type-II PseKNC to represent RNA samples, which has been widely applied in previous methods [34-36]. This encoding can efficiently capture short-range and long-range information of RNA sequences. For a given sequence, it is denoted by:where,where denotes the counted rank of the correlations along an RNA sequence, is the normalized frequency of the uth
k-tuple nucleotide in RNA segment, is the weight factor, and is defined as:The correlation function is defined as:where = 14 physicochemical properties (PCPs), which includes 'Tilt', 'Roll', 'Rise', 'Shift', 'Slide', 'Twist', 'GC content', 'Adenine content', Uracil content', 'Stacking energy', 'Bending stiffness', 'Electron_interaction', 'Enthalpy', and 'Entropy.' We note that the above PCPs are based on DNA, which were modified accordingly to RNA sequence (replacing one base pair from Uracil to Thymine). L is the sequence length. is the numerical value of the vth physicochemical index of dinucleotide at position a and denotes the corresponding value of the dinucleotide at position a+b. Based on our preliminary analysis of these three parameters (), we set =1.0, = 0.8, and k = 6 to calculate the Type-II PseKNC that generates 4098-dimensional feature vector.
Feature Optimization
In general, the feature vectors with higher dimensions (>100) have noisy and irrelevant information [37, 38]. Therefore, it is necessary to exclude that information and improve classification accuracy, which is regarded as one of the key steps in ML-based model development. A two-step feature selection scheme was applied that has been widely used in computational biology [39-42]. Firstly, ranking the original features by F-score algorithm and generate a ranked feature list (from highest to the lowest ranked features) [43]. Secondly, two features were selected at each time from the ranked list, and added consecutively to ERT and developed their corresponding models by 10-fold cross-validation (CV). Eventually, the feature set equivalent to the model with the highest accuracy was considered as optimal features.The F-score of the kth feature is defined as:where , , and denote an average of the kth feature in the combined (both negative and positive), negative, and positive datasets, respectively. and represent the number of negative and positive samples, respectively. and represent the kth feature of lth negative instance and kth feature of lth positive instance, respectively.
Extremely Randomized Tree
Guertz et al. proposed ERT method in 2006 [44], which has been widely applied in various fields [45-51]. A brief description of how we implemented ERT in this study has been described previously [52, 53]. Grid search approach was implemented to optimize three ERT parameters, which include the number of randomly selected features (mtry), number of trees (ntree), and minimum number of samples required to split an internal node (nsplit). The three parameters search ranges were: 312 with a step size of 1, 401000 with a step size of 20, and 415 with a step size of 1.
Cross-validation
We carried out 10-fold CV test to assess model performance. In 10-fold CV, the benchmark dataset was randomly separated into 10 subgroups with roughly equal size [54-64], with each subgroup containing the same number of m6A and non-m6A samples [65, 66]. One of the subgroups was considered as the validation set to assess the trained model and the remaining subgroups were used to train the model. This step was repeated 10 times by considering each one of the subgroups used at least once as a validation [67-69]. The performance of 10 validation sets was averaged and provided an assessment of the global performance.
Performance Assessment
The commonly used four sets of metrics in the fields of computational biology and bioinformatics [57, 70-77] were utilized to quantitatively evaluate the performance of the proposed method. These metrics included Matthews correlation coefficient (MCC), accuracy (ACC), sensitivity (SN), and specificity (SP), which were computed as follows:Where TP and TN respectively represent the number of m6A samples and the number of non-m6A samples correctly predicted. FP and FN represent the numbers of m6A or non-m6A samples wrongly predicted, respectively.
RESULTS AND DISCUSSION
Exploration of Different Feature Encodings in m6A Site Prediction
To identify the appropriate feature encoding in m6A site prediction seven feature encodings were explored, including PseKNC, k-mer (concatenation of mono-, di, tri-, tetra-, and penta-nucleotide composition), binary profile (BPF), electron-ion interaction pseudo potential (PseEIIP), numerical representation of nucleotides (NUM), ring-function-hydrogen-chemical (RFHC) properties, and a combination of dinucleotide binary encoding, and local position-specific dinucleotide frequency (DPE_LPF). A detailed description of the other six feature encodings (except PseKNC) has been provided in previous studies [68, 78, 79] and we modified according to RNA sequence. Seven prediction models were developed and the performances are shown in Fig. (). As shown in Fig. (), we observed that four encodings (BPF, RFHC, DPE_LPF, NUM) achieved a similar performance, whose ACC in the range of 73.8-75.5%; (ii) the next two encodings (Kmer and PseKNC) achieved the same performance (ACC of 71.8%), which is significantly lower than the above four encodings, and EIIP achieved the worst performance. Finally, none of the featured encodings achieved significant performance over others. It should be noted that all these encodings (except NUM and EIIP) have greater than 100-dimensional feature vector. Hence, we applied SFS to see whether any encoding has an advantage over others in terms of performance.
Optimal Feature Selection of Each Encoding and their Performance Comparison
Fig. () shows the ACC curves with gradual increment of features for seven encodings. The result shows that allencodings follow the same pattern, where ACC curve gradually improved, reached its highest point and then deteriorated upon the increment of features. Remarkably, two-step feature selection procedure significantly improved the prediction performance in two encodings (Kmer and PseKNC), slight improvement in four encodings (BPF, DPE_LPF, RFHC and NUM), and no improvement in EIIP encoding when compared to their control (using all features). Next, we examined the highest ACC achieved by each encoding corresponding optimal features. Result shows that Kmer, PseKNC, BPF, RFHC, LPDF, and NUM encodings respectively considered 26.60%, 18.7%, 82.36% 50.98%, 70.0%, and 47.1% as the optimal features when compared to their original dimension (Fig. ). It appears that all features are equally important in case of EIIP encoding, because of its best performance with all features.Next, we compared the optimal feature-based performances. Fig. () shows that PseKNC achieved the ACC of 78.84%, which is 3.17-10.74% higher than other encodings, thus indicating that PseKNC has more discriminative capability to identify m6A sites. Hence, we selected PseKNC-based model namely, ERT-m6Apred, as the final one.
Comparison with the Existing Methods
To see whether our proposed method is better than the existing predictors, we compared our method with the four recent methods, namely M6AMRFS, BERMP, iMethyl-STTNC and iRNA-Freq. Notably, all these methods were trained and validated on the same benchmark dataset as employed in this study, indicating a fair comparison. As shown in Table , our method ERT-m6Apred achieved an ACC of 78.84% and MCC of 0.578. Explicitly, ACC and MCC were 1.7-9.0% and 2.8-20% higher than the existing methods, thus indicating that our predictor ERT-m6Apred is more accurate and balanced than the existing predictors in m6A site prediction.
CONCLUSION
In this study, we reported a novel predictor for m6A site prediction from RNA sequence. To the best of our knowledge, this is the first instance where ERT and PseKNC were employed in m6A site prediction. Comparative analysis of CV results shows that the proposed predictor is better than existing predictors. Although our proposed method showed improved performance over other methods, it still has room for improvement. Recently, several novel computational approaches have been proposed in computational biology [68, 78, 80-85] to identify function from the sequence. Hence, developing a novel prediction model by utilizing such approaches may improve the prediction performance.
Table 1
Performance comparison of the proposed predictor with the existing methods.
Methods
MCC
ACC
SN
SP
M6A-ERT
0.578
0.788
0.822
0.755
M6AMRFS
0.485
0.743
0.752
0.733
BERMP
0.430
0.713
0.730
0.696
iMethyl-STTNC
0.380
0.698
0.703
0.682
iRNA-Freq
0.550
0.771
0.836
0.719
First column represents the method name. The second, third, fourth, and fifth respectively represents Matthews Correlation Coefficient (MCC), Accuracy (ACC), Sensitivity (SN) and Specificity (SP).