Literature DB >> 31581051

iRNA-m7G: Identifying N⁷-methylguanosine Sites by Fusing Multiple Features.

Wei Chen¹, Pengmian Feng², Xiaoming Song³, Hao Lv⁴, Hao Lin⁵.

Abstract

As an essential post-transcriptional modification, N7-methylguanosine (m7G) regulates nearly every step of the life cycle of mRNA. Accurate identification of the m7G site in the transcriptome will provide insights into its biological functions and mechanisms. Although the m7G-methylated RNA immunoprecipitation sequencing (MeRIP-seq) method has been proposed in this regard, it is still cost-ineffective for detecting the m7G site. Therefore, it is urgent to develop new methods to identify the m7G site. In this work, we developed the first computational predictor called iRNA-m7G to identify m7G sites in the human transcriptome. The feature fusion strategy was used to integrate both sequence- and structure-based features. In the jackknife test, iRNA-m7G obtained an accuracy of 89.88%. The superiority of iRNA-m7G for identifying m7G sites was also demonstrated by comparing with other methods. We hope that iRNA-m7G can become a useful tool to identify m7G sites. A user-friendly web server for iRNA-m7G is freely accessible at http://lin-group.cn/server/iRNA-m7G/.

Entities: Chemical Disease Gene Species

Keywords: N(7)-methylguanosine; RNA secondary structure; feature fusion; nucleotide chemical property; pseudo nucleotide composition

Year: 2019 PMID： 31581051 PMCID： PMC6796804 DOI： 10.1016/j.omtn.2019.08.022

Source DB: PubMed Journal: Mol Ther Nucleic Acids ISSN： 2162-2531 Impact factor: 8.886

Introduction

Besides N1-methyladenosine (m1A), N7-methylguanosine (m7G) is another kind of positively charged RNA modification. m7G is added to the 5′ end co-transcriptionally during transcription, and it is essential for efficient gene expression and cell viability. It has been found that m7G is required for nearly all phages of the mRNA cycles, such as RNA splicing, polyadenylation, nuclear export of mRNA, translation, and so on. Although studies on m7G have been carried out for a long time, the knowledge about its function is still limited. The key step of revealing the functions of m7G is to determine its accurate position in the transcriptome. By using the mass spectrometry quantification and m7G-methylated RNA immunoprecipitation sequencing (MeRIP-seq) method, Zhang et al. not only detected the m7G sites in Homo sapiens and Mus. Musculus but also provided the base resolution m7G sites in human HeLa and HepG2 cells. However, the MeRIP-seq method still has its own limitations, and it is cost-ineffective for performing transcriptome-wide detections. Therefore, it is necessary to develop computational methods for identifying m7G sites. To the best of our knowledge, there are no computational methods available for this aim. Inspired by the wide application of machine-learning methods for identifying RNA modification sites,8, 9 in this study, we developed a support vector machine (SVM)-based method, called iRNA-m7G, to identify m7G sites. To extract informative features to encode the RNA sequence, the feature fusion strategy was used to integrate three kinds of features, including nucleotide property and frequency, pseudo nucleotide composition, and secondary structure component. Experiments exhibited that the feature fusion strategy is superior to the single kind of features for identifying m7G sites. Moreover, a user-friendly web server for iRNA-m7G has been provided at http://lin-group.cn/server/iRNA-m7G/. We expect that the proposed predictor will speed up the detection of the m7G site.

Results and Discussion

Performance of Each Kind of Feature

We built three models based on the three kinds of features (nucleotide property and frequency [NPF], pseudo nucleotide composition [PseDNC], and secondary structure component [SSC]), and we compared their performances for identifying m7G sites. As indicated in Equations 4 and 5, the PseDNC model is dependent on two parameters, w and λ. Hence, we first optimized the parameters of PseDNC. In general, the greater the λ value is, the more global sequence-order information the model contains. However, a larger λ would reduce the cluster-tolerant capacity so as to lower the cross-validation accuracy due to an overfitting problem. Therefore, the search ranges for w and λ were set in [0, 1] and [1, 10] with a step of 0.1 and 1, respectively. As shown in Figure 1, the PseDNC-based model yielded the best results when w = 0.8 and λ = 8.

Figure 1

Determining the Optimal Values for the Two Parameters w and λ of PseDNC

Determining the Optimal Values for the Two Parameters w and λ of PseDNC The k-fold cross-validation test method was often used to examine the quality of various predictors. For saving computational time, in the current study, the 10-fold cross-validation test was used to evaluate the performance of these models. Their predictive results were reported in Table 1. Among the three models, the NPF-based model obtained the highest accuracy of 89.14%, which is approximately 5% and 14% higher than that of the PseDNC- and SSC-based models, respectively, for identifying m7G sites in the dataset.

Table 1

Predictive Results for Identifying m7G Sites by Using Different Features

Features	Sn (%)	Sp (%)	Acc (%)	MCC	auROC
NPF	88.12	90.15	89.14	0.78	0.899
PseDNC	81.92	87.99	84.95	0.70	0.841
SSC	73.11	78.71	75.91	0.52	0.776
Fusion	88.66	90.96	89.81	0.80	0.946

Sn, sensitivity; Sp, specificity; Acc, accuracy; MCC, Mathew’s correlation coefficient; auROC, area under the receiver operating characteristic curve; NPF, nucleotide property and frequency; PseDNC, pseudo nucleotide composition; SSC, secondary structure component.

Predictive Results for Identifying m7G Sites by Using Different Features Sn, sensitivity; Sp, specificity; Acc, accuracy; MCC, Mathew’s correlation coefficient; auROC, area under the receiver operating characteristic curve; NPF, nucleotide property and frequency; PseDNC, pseudo nucleotide composition; SSC, secondary structure component. To objectively compare their performances, the area under the receiver operating characteristic curve (auROC) of these methods was also calculated. The NPF-based model obtained an auROC of 0.899, higher than the 0.841 and 0.776 obtained by the PseDNC- and SSC-based models, respectively.

Performance of Fusing Multiple Features

To investigate whether the feature fusion strategy could improve the performance, we built another model by fusing the NPF, PseDNC, and SSC features. The framework of how to build the model is shown in Figure 2. The model thus obtained was then evaluated by using the 10-fold cross-validation test. The detailed results are provided in the last row of Table 1. As indicated in Table 1, the sensitivity (Sn), specificity (Sp), accuracy (Acc), and Mathew’s correlation coefficient (MCC) were all improved compared with those obtained by the NPF-, PseDNC-, and SSC-based models.

Figure 2

Framework of Developing iRNA-m7G

For an RNA sequence, it is converted into a feature vector by fusing nucleotide property and frequency, pseudo nucleotide composition, and secondary structure component. The support vector machine was used to build the classification model.

Framework of Developing iRNA-m7G For an RNA sequence, it is converted into a feature vector by fusing nucleotide property and frequency, pseudo nucleotide composition, and secondary structure component. The support vector machine was used to build the classification model. To intuitively compare the performance of the models based on different features, their ROC curves from the 10-fold cross-validation test were plotted in Figure 3. The fusion strategy-based model obtained an auROC of 0.946, which is higher than those of the NPF-, PseDNC-, and SSC-based models.

Figure 3

The Receiver Operating Characteristic Curves of the Models Based on Different Features Identifying m7G sites

SSC is the abbreviation for secondary structure component, NPF is for nucleotide property and frequency, PseDNC is for pseudo nucleotide composition, and fusion is the combination of the abovementioned three kinds of features. The auROC values were provided in brackets.

The Receiver Operating Characteristic Curves of the Models Based on Different Features Identifying m7G sites SSC is the abbreviation for secondary structure component, NPF is for nucleotide property and frequency, PseDNC is for pseudo nucleotide composition, and fusion is the combination of the abovementioned three kinds of features. The auROC values were provided in brackets. Moreover, to further demonstrate its stability for identifying m7G sites, the fusion strategy-based model was also evaluated by the jackknife test, in which each sample in the training dataset is in turn singled out as an independent test sample, and all the properties are calculated without including the one being identified. In the jackknife test, the fusion strategy-based model obtained an accuracy of 89.88% with the sensitivity of 89.07%, specificity of 90.69%, and MCC of 0.80, which is comparable to those from the 10-fold cross-validation test. These results indicate that the feature fusion strategy is effective and the model is robust for identifying m7G sites.

Comparison of SVM and Other Classifiers

Since there is no computational method that has been proposed for identifying m7G sites, to demonstrate its effectiveness, we compared the performance of the current SVM-based model with those of the Naive Bayes-, Random Forest-, LogitBoost-, and BayesNet-based models. The Naive Bayes, Random Forest, LogitBoost, and BayesNet were implemented by using WEKA. For a fair comparison, all the models were built by using the the feature fusion strategy and tested on the same dataset. The 10-fold cross-validation test results of these models are reported in Table 2. As shown in Table 2, the SVM-based model obtained the best results in terms of the four metrics defined in Equation 9. The predictive accuracy of the SVM-based model is 9.7%, 3.3%, 6.1%, and 7.7% higher than those of the Naive Bayes-, Random Forest-, LogitBoost-, and BayesNet-based models, respectively. This result demonstrates that the SVM is more effective than other classification algorithms for identifying m7G sites.

Table 2

Performance Comparison of Different Classifiers for Identifying m7G Sites by the 10-Fold Cross-Validation Test

Classifiers	Sn (%)	Sp (%)	Acc (%)	MCC
Naive Bayes	72.47	87.85	80.16	0.61
Random Forest	83.27	89.88	86.57	0.73
LogitBoost	81.38	86.23	83.81	0.68
BayesNet	77.19	87.04	82.12	0.65
SVM	88.66	90.96	89.81	0.80

Sn, sensitivity; Sp, specificity; Acc, accuracy; MCC, Mathew’s correlation coefficient; SVM, support vector machine.

Performance Comparison of Different Classifiers for Identifying m7G Sites by the 10-Fold Cross-Validation Test Sn, sensitivity; Sp, specificity; Acc, accuracy; MCC, Mathew’s correlation coefficient; SVM, support vector machine.

Conclusions

In this study, we proposed iRNA-m7G, the first computational method to identify m7G sites. In this predictor, the feature fusion strategy was used to represent RNA sequences. Comparative results demonstrated that the feature fusion strategy is much more effective for identifying m7G sites than a single kind of feature. Moreover, we also compared iRNA-m7G with the other four machine-learning algorithm-based methods, and we found that the SVM-based model achieves the best performance for identifying m7G sites. For the convenience of the scientific community, a publicly accessible web server called iRNA-m7G that allows the prediction of m7G sites in RNA was established at http://lin-group.cn/server/iRNA-m7G/. We anticipate that iRNA-m7G will become a useful tool for identifying m7G sites. In future works, we will collect more m7G data and use powerful methods such as deep learning12, 13, 14, 15 to improve the performance of computationally identifying m7G sites.

Materials and Methods

Benchmark Datasets

By using the MeRIP-seq method, Zhang et al. detected 801 base-resolution m7G sites that appeared in human HeLa and HepG2 cells. By mapping these sites to the human genome (hg19), 801 m7G sites containing sequences were obtained. Preliminary tests indicated that the best predictive result was achieved when the sequence length is 41 bp with the m7G site in the center. To build a high-quality dataset, the CD-HIT software with the threshold of 80% was used to remove redundant sequences.16, 17 Accordingly, we obtained 741 m7G site-containing sequences. The non-m7G site-containing sequences were obtained by choosing 41-bp-long sequences with the intermediate guanosine not detected as m7G by the MeRIP-seq method. By doing so, a huge number of negative samples is obtained. Since imbalanced datasets affect the performance evaluation of computational methods, to balance out the numbers between positive and negative samples in model training, we randomly picked out 741 non-m7G site sequences with the sequence similarity less than 80% to form the negative samples.

Sequence Representation

NPF

The NPF is an effective sequence-encoding scheme for computationally identifying nucleotide modification sites.18, 19, 20, 21 According to NPF, the i-th nucleotide n in RNA sequence can be represented by a four-dimensional vector (x, y, z, d), in which the elements are defined as follows:where the x, y, and z coordinates stand for the ring structure, hydrogen bond, and chemical functionality, respectively; d is the accumulated frequency and is defined aswhere l is the sequence length, and |N| is the length of the i-th prefix string {n1, n2, …, n} in the sequence. According to NPF, an RNA sequence with a length of l bp will be encoded by the following vector:

PseDNC

Besides the local sequence order information, the global sequence order effect is also important for computationally identifying RNA modification sites. Accordingly, in the current study, the PseDNC was also used to encode the RNA sequences, which can be calculated by using PseKNC and PseKNC-General. Based on PseDNC, the RNA sequence is converted into a discrete vector defined as follows:where is the occurrence frequency of the u-th non-overlapping dinucleotide in the RNA sequence, andwhere is the j-tier correlation factor that reflects the sequence order correlation between all the j-th most contiguous dinucleotide, and is defined aswhere μ is the number of RNA physicochemical properties considered, is the normalized numerical value of the g-th (g = 1, 2, 3, …, μ) RNA local structural property for the dinucleotide RR at position i, and is the corresponding value for the dinucleotide RR at position i + j. In the current work, the enthalpy, entropy, and free energy were used to define PseDNC, which have been used to identify other kinds of RNA modifications. The values for the three physicochemical properties of the 16 different RNA dinucleotides were obtained from previous works.25, 26 Thus, μ in Equation 7 is equal to 3.

SSC

The formation of RNA modification is affected by RNA structures. Hence, the RNAfold tool in the ViennaRNA package was used to predict the secondary structure of the RNA sequences in the dataset. For each position in the RNA, the paired nucleotide was represented by a parenthesis (“(” or “)”), while the unpaired one was represented by a dot (“.”). In the current study, we do not distinguish “(” and “)” and use “(” for both statuses. For a given tri-nucleotide, there are eight (23) possible structure statuses (i.e., “(((,” “((.,” “(..,” “(.(,” “.((,” “.(.,” “..(,” and “…”). Together with the first nucleotide of the tri-nucleotide, there will be 32 (4 × 8) possible sequence-structure modes denoted as “A-(((,” “A-((.,” “A-(..,” …, and “U-…”. Therefore, by using the sequence-structure mode, an RNA sequence can be represented as follows:

SVM

In the current study, the LibSVM package 3.18, which is available at https://www.csie.ntu.edu.tw/∼cjlin/libsvm/, was used to perform the classification task. The basic idea of SVM is to transform the input data into a high-dimensional feature space and then determine the optimal separating hyperplane. Because of its better performance, the radial basis kernel function (RBF) was used to obtain the separating hyperplane. The regularization parameter C and kernel parameter γ of the SVM operation engine were optimized in the ranges of [2−5, 215] and [2−15, 2−5] with the steps of 2 and 2−1, respectively. The final prediction was made according to the probability obtained by SVM.29, 30, 31, 32, 33 If its probability is >0.5, a guanine will be predicted as an m7G site.

Evaluation Metrics

In this study, the four metrics,34, 35, 36, 37, 38, 39, 40 namely, Sn, Sp, Acc, and MCC, were used to measure the performance of the proposed methods, which are defined as follows:where represents the m7G site-containing sequence, while is the number of m7G site-containing sequences incorrectly predicted to be of false m7G site-containing sequences; is the total number of false m7G site-containing sequences, while is the number of the false m7G site-containing sequences incorrectly predicted to be of m7G site-containing sequences. Moreover, by plotting the sensitivity against (1-specificity) with the varying of the threshold, the ROC curve41, 42 was generated to evaluate the performance of the proposed method. The auROC is an indicator of the performance of the method. An auROC value of 0.5 is equivalent to random prediction while an auROC of 1 represents a perfect one.

Author Contributions

W.C. and H. Lin conceived and designed the study. W.C., P.F., X.S., and H. Lin conducted the experiments. P.F., W.C., and X.S. implemented the algorithms. H. Lv established the web server. W.C., P.F., X.S., H. Lv, and H. Lin performed the analysis and wrote the paper. All authors read and approved the final manuscript.

Conflicts of Interest

The authors declare no competing interests.

39 in total

1. Identification of hormone binding proteins based on machine learning methods.

Authors: Jiu Xin Tan; Shi Hao Li; Zi Mei Zhang; Cui Xia Chen; Wei Chen; Hua Tang; Hao Lin
Journal: Math Biosci Eng Date: 2019-03-22 Impact factor: 2.080

2. Identify origin of replication in Saccharomyces cerevisiae using two-step feature selection technique.

Authors: Fu-Ying Dao; Hao Lv; Fang Wang; Chao-Qin Feng; Hui Ding; Wei Chen; Hao Lin
Journal: Bioinformatics Date: 2019-06-01 Impact factor: 6.937

3. Computational Prediction of Sigma-54 Promoters in Bacterial Genomes by Integrating Motif Finding and Machine Learning Strategies.

Authors: Bingqiang Liu; Ling Han; Xiangrong Liu; Jichang Wu; Qin Ma
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2018-03-15 Impact factor: 3.710

4. Sequence clustering in bioinformatics: an empirical study.

Authors: Quan Zou; Gang Lin; Xingpeng Jiang; Xiangrong Liu; Xiangxiang Zeng
Journal: Brief Bioinform Date: 2018-09-18 Impact factor: 11.622

5. i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome.

Authors: Wei Chen; Hao Lv; Fulei Nie; Hao Lin
Journal: Bioinformatics Date: 2019-08-15 Impact factor: 6.937

6. Transcriptome-wide Mapping of Internal N⁷-Methylguanosine Methylome in Mammalian mRNA.

Authors: Li-Sheng Zhang; Chang Liu; Honghui Ma; Qing Dai; Hui-Lung Sun; Guanzheng Luo; Zijie Zhang; Linda Zhang; Lulu Hu; Xueyang Dong; Chuan He
Journal: Mol Cell Date: 2019-04-25 Impact factor: 17.970

Review 7. Discovery of m(7)G-cap in eukaryotic mRNAs.

Authors: Yasuhiro Furuichi
Journal: Proc Jpn Acad Ser B Phys Biol Sci Date: 2015 Impact factor: 3.493

8. PACES: prediction of N4-acetylcytidine (ac4C) modification sites in mRNA.

Authors: Wanqing Zhao; Yiran Zhou; Qinghua Cui; Yuan Zhou
Journal: Sci Rep Date: 2019-07-31 Impact factor: 4.379

9. Protein tertiary structure modeling driven by deep learning and contact distance prediction in CASP13.

Authors: Jie Hou; Tianqi Wu; Renzhi Cao; Jianlin Cheng
Journal: Proteins Date: 2019-04-25

10. CD-HIT: accelerated for clustering the next-generation sequencing data.

Authors: Limin Fu; Beifang Niu; Zhengwei Zhu; Sitao Wu; Weizhong Li
Journal: Bioinformatics Date: 2012-10-11 Impact factor: 6.937

19 in total

1. Geographic encoding of transcripts enabled high-accuracy and isoform-aware deep learning of RNA methylation.

Authors: Daiyun Huang; Kunqi Chen; Bowen Song; Zhen Wei; Jionglong Su; Frans Coenen; João Pedro de Magalhães; Daniel J Rigden; Jia Meng
Journal: Nucleic Acids Res Date: 2022-10-14 Impact factor: 19.160

2. RMDisease: a database of genetic variants that affect RNA modifications, with implications for epitranscriptome pathogenesis.

Authors: Kunqi Chen; Bowen Song; Yujiao Tang; Zhen Wei; Qingru Xu; Jionglong Su; João Pedro de Magalhães; Daniel J Rigden; Jia Meng
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

3. HSM6AP: a high-precision predictor for the Homo sapiens N6-methyladenosine (m^6 A) based on multiple weights and feature stitching.

Authors: Jing Li; Shida He; Fei Guo; Quan Zou
Journal: RNA Biol Date: 2021-02-12 Impact factor: 4.652

4. Machine Learning of Single-Cell Transcriptome Highly Identifies mRNA Signature by Comparing F-Score Selection with DGE Analysis.

Authors: Pengfei Liang; Wuritu Yang; Xing Chen; Chunshen Long; Lei Zheng; Hanshuang Li; Yongchun Zuo
Journal: Mol Ther Nucleic Acids Date: 2020-02-13 Impact factor: 8.886

Review 5. Epigenetics: Roles and therapeutic implications of non-coding RNA modifications in human cancers.

Authors: Dawei Rong; Guangshun Sun; Fan Wu; Ye Cheng; Guoqiang Sun; Wei Jiang; Xiao Li; Yi Zhong; Liangliang Wu; Chuanyong Zhang; Weiwei Tang; Xuehao Wang
Journal: Mol Ther Nucleic Acids Date: 2021-05-01 Impact factor: 8.886

6. A Comparative Analysis of Single-Cell Transcriptome Identifies Reprogramming Driver Factors for Efficiency Improvement.

Authors: Hanshuang Li; Mingmin Song; Wuritu Yang; Pengbo Cao; Lei Zheng; Yongchun Zuo
Journal: Mol Ther Nucleic Acids Date: 2020-01-14 Impact factor: 8.886