Literature DB >> 30470762

RNAm5Cfinder: A Web-server for Predicting RNA 5-methylcytosine (m5C) Sites Based on Random Forest.

Jianwei Li1,2, Yan Huang1, Xiaoyue Yang1, Yiran Zhou2, Yuan Zhou3.   

Abstract

5-methylcytosine (m5C) is a common nucleobase modification, and recent investigations have indicated its prevalence in cellular RNAs including mRNA, tRNA and rRNA. With the rapid accumulation of m5C sites data, it becomes not only feasible but also important to build an accurate model to predict m5C sites in silico. For this purpose, here, we developed a web-server named RNAm5Cfinder based on RNA sequence features and machine learning method to predict RNA m5C sites in eight tissue/cell types from mouse and human. We confirmed the accuracy and usefulness of RNAm5Cfinder by independent tests, and the results show that the comprehensive and cell-specific predictors could pinpoint the generic or tissue-specific m5C sites with the Area Under Curve (AUC) no less than 0.77 and 0.87, respectively. RNAm5Cfinder web-server is freely available at http://www.rnanut.net/rnam5cfinder .

Entities:  

Year:  2018        PMID: 30470762      PMCID: PMC6251864          DOI: 10.1038/s41598-018-35502-4

Source DB:  PubMed          Journal:  Sci Rep        ISSN: 2045-2322            Impact factor:   4.379


Introduction

RNA modification plays an important role in all three domains of life[1-3]. To date, more than 150 kinds of RNA modifications have been discovered, while 5-methylcytosine (m5C) is one of the most prevalent modification types[4]. Thanks to the novel applications of high-throughput sequencing technique for detecting RNA m5C modification (e.g., bisulfite sequencing and aza-IP), a pilot whole-transcriptome map of m5C sites have become available, where the modification sites mainly appear in the anticodon loop and the variable region of tRNAs and rRNAs, and the coding sequences in mRNAs[5-9]. Similar to other nucleobase modifications in RNA, m5C also influences RNA structural stability and translation efficiency, and further researches revealed that it could promote mRNA export and regulate tissue differentiation[10,11]. But the functions of m5C in RNA are still not fully understood, partly because the experimental identification of m5C sites is still expensive and labor-intensive. For this purpose, here, we developed a web-based tool named RNAm5Cfinder to predict m5C sites, which would help researchers to screen potential m5C sites easily and quickly and provide a new tool to dig functional implication of m5C. RNAm5Cfinder is a platform with an easy-to-use web interface to predict m5C modification sites in RNA sequences. It adopts one-hot encoding for coding RNA sequences and random forest algorithm which is a supervised machine learning method for solving classification problems. In view of the fact that m5C is a tissue-specific modification, we built independent predictor for every tissue/cell type respectively. Finally, we optimized each predictor independently by cross-validation and benchmarked the predictors by independent tests. To our best knowledge, RNAm5Cfinder is the first m5C predictor that allows for predicting tissue-specific m5C sites with competitive precision.

Results and Discussion

Establishment of the predictor and performance benchmarking

The m5C modification data covering 7 tissues of mouse and human Hela cells were collected from previous studies[10,12]. We first integrated all m5C sites to build a comprehensive (generic) predictor. In the training process, we continuously optimized the ratio of the positives and the negatives of the training data set and changed the number of the decision trees in the random forest predictor by five-fold cross-validation. The results suggest that the optimal parameters are 1:30 ratio and 300 decision trees, respectively. In order to verify the performance of the predictor, we benchmarked and compared its performance with other state-of-the-art published web servers for predicting RNA m5C sites on the same independent test set. We found two available online servers for predicting RNA m5C sites which are iRNA-PseColl developed by Feng et al. and M5C-HPCR developed by Zhang et al.[13,14]. Both of them can predict m5C sites in RNA sequences, but they don’t permit tissue-specific prediction. Therefore, we compared the performance of our comprehensive predictor with iRNA-PseColl and M5C-HPCR. Note that the thresholds of the servers above are fixed, resulting in a single point in the ROC (receiver operating characteristic curve) curve that corresponds to their performance (Fig. 1). As for the strategy for coding RNA sequence, RNAm5Cfinder adopted one-hot encoding and by trying to re-train our predictor with Feng’s coding strategy and found that the performance was slightly reduced (Fig. 2), indicating that one-hot encoding is at least comparable to the current state-of-art method for RNA m5C site prediction. Another reason for picking one-hot encoding is that it is timesaving and could give the users a good experience comparing to other strategies.
Figure 1

Performance comparison between RNAm5Cfinder comprehensive predictor and other available servers on independent test.

Figure 2

The comparison between one-hot encoding and Feng’s encoding on independent test.

Performance comparison between RNAm5Cfinder comprehensive predictor and other available servers on independent test. The comparison between one-hot encoding and Feng’s encoding on independent test.

The performance of the tissue-specific predictors

Taking into account the modification spectrum in different cell types or tissues are not the same, one comprehensive predictor can not accurately predict the m5C sites from each specific tissue or cell type. We further applied tissue-specific training and independent test sets where RNA m5C modification data was came from experiments on single tissue or cell to test and benchmark the tissue-specific m5C predictors (Table 1). In order to verify the robustness of the constructed tissue-specific predictors, we performed both intra- and inter-tissue independent tests for each tissue-specific predictor. For each independent test set, we removed the samples which were used to train the predictors for the rest of tissues. In other words, we only considered tissue-specific sites in the independent test for the intra- and inter-tissue independent tests. The results are summarized in Fig. 3. Clearly, the intra-tissue prediction performances, which are all above 0.87 in terms of AUC, are substantially better than inter-tissue prediction performance. This is consistent with previous studies, where m5C is implied as a tissue-specific modification[10]. This result also supports that it is necessary to build tissue-specific m5C predictors.
Table 1

AUC of independent test of different predictors.

Cell typesAUC
mouse_ESC0.902
mouse_Heart0.772
mouse_Kidney0.768
mouse_Liver0.768
mouse_Muscle0.767
mouse_Small-Intestine0.769
mouse_Brain0.775
human_Hela0.765
Figure 3

The results of intra- and inter-tissue independent tests for each tissue specific predictor. (A) The color correlates with the performance (AUC). (B) ESC, embryonic stem cell.

AUC of independent test of different predictors. The results of intra- and inter-tissue independent tests for each tissue specific predictor. (A) The color correlates with the performance (AUC). (B) ESC, embryonic stem cell.

The construction of RNAm5Cfinder web-server

To facilitate the community, we built a web-server named RNAm5Cfinder with the optimized comprehensive and tissue-specific predictors mentioned above. RNAm5Cfinder has a user-friendly interface and step-by-step guide. It takes the FASTA sequences as the input and provide the option to switch between the comprehensive predictor and the tissue-specific predictors. We also provide 3 levels of stringent thresholds, corresponding to the false positive rate values of 1%, 5%, 10%. Considering users may analyze large dataset which will spend plenty of time, RNAm5Cfinder also supports the function to send results to the submitted E-mail address.

Methods

Datasets

We gathered three available m5C datasets in GEO database including GSE90963 (human Hela cell), GSE93749 (human Hela cell; heart, muscle, brain, kidney and liver of mouse) and GSE83432 (mouse ESC and brain). Then m5C sites from the three datasets were first mapped to the Ensembl transcripts (queried at Feb, 2018, the genome version is GRCh37 for human and GRCm38 for mouse)[15]. For multiple transcripts of the same gene, we picked the mRNA transcript which have relatively more modification sites to insure the quality and reliability of data. One quarter of the m5C site data was randomly selected as the independent test set while the rest was used to train the predictors. The negative samples were randomly selected from the non-modified C sites in the transcripts. Since the ratio of positive and negative training samples could affect the precision of the prediction model, we preliminarily tested 3 ratios (1:10, 1:30, 1:50) and finally considered the best one (1:30) based on cross-validation. In order to fit the real-world data, as for the independent test sets, all of the non-modified C sites were used as the negative samples (Table 2).
Table 2

The sample size of different tissues’ training and test datasets.

TissuesTraining setTest set
posnegposneg
Comprehensive19,798593,94166361,924,243
ESCa-specific3440103,201828299,610
Heart-specific12,703381,09110030,433
Kidney-specific12,700381,00112237,088
Liver-specific11,937358,11112537,844
Muscle-specific11,826354,78111836,519
Small-Intestine-specific11,372341,16110732,170
Brain-specific19,141424,231472155,409

The ratio of the positives and the negatives of the training set and test set were set to 1:30 and 1:all respectively. As for the test sets of tissue-specific predictors, samples which were used to train predictors for the other tissues were discarded.

aESC, embryonic stem cell.

The sample size of different tissues’ training and test datasets. The ratio of the positives and the negatives of the training set and test set were set to 1:30 and 1:all respectively. As for the test sets of tissue-specific predictors, samples which were used to train predictors for the other tissues were discarded. aESC, embryonic stem cell.

Sequence encoding

To train the machine learning model, the RNA sequence flanking the modified/non-modified sites should be translated to the numeric feature encoding. In this study, two kinds of encoding strategies were tested and compared, which were the one-hot encoding[16] and Feng’s encoding[14]. The one-hot encoding uses n bits of 0 or 1 to represent n kinds of nucleotide state. For each position, the A, G, C, T are translated into vectors of (1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0) and (0, 0, 0, 1), respectively. Feng’s encoding also uses four bits to represent specific nucleotide. But unlike one-hot encoding, the first three bits in Feng’s encoding represent three kinds of physicochemical characters (which are the ring number, the chemical functionality and the number of hydrogen bonds). And the fourth bit of Feng’s encoding represents the accumulated occurrence frequency of the nucleotide in the sequence. Therefore, A, G, C, T are translated into vectors of (1, 1, 1, FreqA), (1, 0, 0, FreqG), (0, 1, 0, FreqC) and (0, 0, 1, FreqT), respectively. The size of flanking sequence window to be encoded by the one-hot and Feng’s encodings are both 10, which were optimized by five-fold cross-validation. According to their performance and complexity we finally chose one-hot encoding strategy.

Machine learning algorithm

We have tested four methods of machine learning which are logistic regression, naïve Bayes, Decision Tree (with parameters minsplit = 35, cp = 0.00001 and maxdepth = 30) and Random forest (RF) with integrated RNA m5C sites. The performance of each algorithm is shown in Table 3. Considering both efficiency and accuracy, we finally chose RF as our preferred algorithm. RF algorithm is a robust machine learning framework that has been widely used in medicine and biology information fields[17]. RF consists of a large ensemble of classification and regression trees (CARTs). The number of CARTs is defined as n_tree, which was also optimized by cross-validation. The random forest algorithm was implemented by using the ‘randomForest’ package in R[18].
Table 3

Performance of different machine learning algorithm.

AlgorithmAUC
logistic regression0.700
naïve Bayes0.686
Decision Tree0.726
Random forest0.773
Performance of different machine learning algorithm.

Performance evaluation

In this study we used ROC (receiver operating characteristic) curve, which is less affected by the unbalanced test data set, to evaluate the performance of predictors. ROC curve reflects the overall relationship between sensitivity and specificity when different thresholds are applied. The sensitivity and specificity are defined aswhere TP, TN, FP and FN represent the number of true positive, true negative, false positive and false negative samples, respectively. The larger the area under the curve (AUC), the higher the prediction performance. We benchmarked our predictors on the independent test sets. We also compared the comprehensive predictor of RNAm5Cfinder with iRNA-PseColl and M5C-HPCR on the same independent test set. The binary (yes or no) prediction results of iRNA-PseColl and M5C-HPCR were obtained by submitting the RNA sequences to their servers.

Construction of web-server platform

The user interface and message response mechanisms were based on JavaScript and Ajax. The data processing module was written by PHP5 and could process the input sequences into the numeric sequence encoding for subsequent random forest prediction.

Conclusions

From above analyses, we can draw a conclusion that RNAm5Cfinder is an efficient tool to predict m5C sites. Comparing with other predictors, RNAm5Cfinder has two advantages: (1) Larger and more updated dataset, which together with the random forest machine learning framework, results in a better performance. (2) Ability to predict tissue-specific m5C sites. We believe that RNAm5Cfinder has great potentials and with more m5C site data become available, the performance of RNAm5Cfinder could be further improved.
  17 in total

Review 1.  Emerging roles of RNA modifications in bacteria.

Authors:  Carmelita Nora Marbaniang; Jörg Vogel
Journal:  Curr Opin Microbiol       Date:  2016-01-21       Impact factor: 7.934

2.  SRAMP: prediction of mammalian N6-methyladenosine (m6A) sites based on sequence-derived features.

Authors:  Yuan Zhou; Pan Zeng; Yan-Hui Li; Ziding Zhang; Qinghua Cui
Journal:  Nucleic Acids Res       Date:  2016-02-20       Impact factor: 16.971

Review 3.  Random forests for genomic data analysis.

Authors:  Xi Chen; Hemant Ishwaran
Journal:  Genomics       Date:  2012-04-21       Impact factor: 5.736

4.  RNA cytosine methylation analysis by bisulfite sequencing.

Authors:  Matthias Schaefer; Tim Pollex; Katharina Hanna; Frank Lyko
Journal:  Nucleic Acids Res       Date:  2008-12-05       Impact factor: 16.971

5.  5-methylcytosine promotes mRNA export - NSUN2 as the methyltransferase and ALYREF as an m5C reader.

Authors:  Xin Yang; Ying Yang; Bao-Fa Sun; Yu-Sheng Chen; Jia-Wei Xu; Wei-Yi Lai; Ang Li; Xing Wang; Devi Prasad Bhattarai; Wen Xiao; Hui-Ying Sun; Qin Zhu; Hai-Li Ma; Samir Adhikari; Min Sun; Ya-Juan Hao; Bing Zhang; Chun-Min Huang; Niu Huang; Gui-Bin Jiang; Yong-Liang Zhao; Hai-Lin Wang; Ying-Pu Sun; Yun-Gui Yang
Journal:  Cell Res       Date:  2017-04-18       Impact factor: 25.617

6.  iRNA-PseColl: Identifying the Occurrence Sites of Different RNA Modifications by Incorporating Collective Effects of Nucleotides into PseKNC.

Authors:  Pengmian Feng; Hui Ding; Hui Yang; Wei Chen; Hao Lin; Kuo-Chen Chou
Journal:  Mol Ther Nucleic Acids       Date:  2017-03-29

7.  Identification of direct targets and modified bases of RNA cytosine methyltransferases.

Authors:  Vahid Khoddami; Bradley R Cairns
Journal:  Nat Biotechnol       Date:  2013-04-21       Impact factor: 54.908

8.  NSun2-mediated cytosine-5 methylation of vault noncoding RNA determines its processing into regulatory small RNAs.

Authors:  Shobbir Hussain; Abdulrahim A Sajini; Sandra Blanco; Sabine Dietmann; Patrick Lombard; Yoichiro Sugimoto; Maike Paramor; Joseph G Gleeson; Duncan T Odom; Jernej Ule; Michaela Frye
Journal:  Cell Rep       Date:  2013-07-18       Impact factor: 9.423

9.  MODOMICS: a database of RNA modification pathways. 2017 update.

Authors:  Pietro Boccaletto; Magdalena A Machnicka; Elzbieta Purta; Pawel Piatkowski; Blazej Baginski; Tomasz K Wirecki; Valérie de Crécy-Lagard; Robert Ross; Patrick A Limbach; Annika Kotter; Mark Helm; Janusz M Bujnicki
Journal:  Nucleic Acids Res       Date:  2018-01-04       Impact factor: 16.971

Review 10.  Characterizing 5-methylcytosine in the mammalian epitranscriptome.

Authors:  Shobbir Hussain; Jelena Aleksic; Sandra Blanco; Sabine Dietmann; Michaela Frye
Journal:  Genome Biol       Date:  2013-11-29       Impact factor: 13.583

View more
  11 in total

1.  Deepm5C: A deep-learning-based hybrid framework for identifying human RNA N5-methylcytosine sites using a stacking strategy.

Authors:  Md Mehedi Hasan; Sho Tsukiyama; Jae Youl Cho; Hiroyuki Kurata; Md Ashad Alam; Xiaowen Liu; Balachandran Manavalan; Hong-Wen Deng
Journal:  Mol Ther       Date:  2022-05-06       Impact factor: 12.910

2.  Prediction of m5C Modifications in RNA Sequences by Combining Multiple Sequence Features.

Authors:  Lijun Dou; Xiaoling Li; Hui Ding; Lei Xu; Huaikun Xiang
Journal:  Mol Ther Nucleic Acids       Date:  2020-06-10       Impact factor: 8.886

Review 3.  A Census and Categorization Method of Epitranscriptomic Marks.

Authors:  Julia Mathlin; Loredana Le Pera; Teresa Colombo
Journal:  Int J Mol Sci       Date:  2020-06-30       Impact factor: 5.923

4.  RNAmod: an integrated system for the annotation of mRNA modifications.

Authors:  Qi Liu; Richard I Gregory
Journal:  Nucleic Acids Res       Date:  2019-07-02       Impact factor: 16.971

5.  PACES: prediction of N4-acetylcytidine (ac4C) modification sites in mRNA.

Authors:  Wanqing Zhao; Yiran Zhou; Qinghua Cui; Yuan Zhou
Journal:  Sci Rep       Date:  2019-07-31       Impact factor: 4.379

6.  RPS: a comprehensive database of RNAs involved in liquid-liquid phase separation.

Authors:  Mengni Liu; Huiqin Li; Xiaotong Luo; Jieyi Cai; Tianjian Chen; Yubin Xie; Jian Ren; Zhixiang Zuo
Journal:  Nucleic Acids Res       Date:  2022-01-07       Impact factor: 16.971

7.  m5Cpred-XS: A New Method for Predicting RNA m5C Sites Based on XGBoost and SHAP.

Authors:  Yinbo Liu; Yingying Shen; Hong Wang; Yong Zhang; Xiaolei Zhu
Journal:  Front Genet       Date:  2022-03-30       Impact factor: 4.599

Review 8.  Epigenetics: Roles and therapeutic implications of non-coding RNA modifications in human cancers.

Authors:  Dawei Rong; Guangshun Sun; Fan Wu; Ye Cheng; Guoqiang Sun; Wei Jiang; Xiao Li; Yi Zhong; Liangliang Wu; Chuanyong Zhang; Weiwei Tang; Xuehao Wang
Journal:  Mol Ther Nucleic Acids       Date:  2021-05-01       Impact factor: 8.886

9.  m5C-Related lncRNAs Predict Overall Survival of Patients and Regulate the Tumor Immune Microenvironment in Lung Adenocarcinoma.

Authors:  Junfan Pan; Zhidong Huang; Yiquan Xu
Journal:  Front Cell Dev Biol       Date:  2021-06-29

10.  BiLSTM-5mC: A Bidirectional Long Short-Term Memory-Based Approach for Predicting 5-Methylcytosine Sites in Genome-Wide DNA Promoters.

Authors:  Xin Cheng; Jun Wang; Qianyue Li; Taigang Liu
Journal:  Molecules       Date:  2021-12-07       Impact factor: 4.411

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.