Literature DB >> 28476023

iRNAm5C-PseDNC: identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition.

Wang-Ren Qiu^1,2,3, Shi-Yu Jiang², Zhao-Chun Xu², Xuan Xiao^2,3, Kuo-Chen Chou^3,4,5.

Abstract

Occurring at cytosine (C) of RNA, 5-methylcytosine (m5C) is an important post-transcriptional modification (PTCM). The modification plays significant roles in biological processes by regulating RNA metabolism in both eukaryotes and prokaryotes. It may also, however, cause cancers and other major diseases. Given an uncharacterized RNA sequence that contains many C residues, can we identify which one of them can be of m5C modification, and which one cannot? It is no doubt a crucial problem, particularly with the explosive growth of RNA sequences in the postgenomic age. Unfortunately, so far no user-friendly web-server whatsoever has been developed to address such a problem. To meet the increasingly high demand from most experimental scientists working in the area of drug development, we have developed a new predictor called iRNAm5C-PseDNC by incorporating ten types of physical-chemical properties into pseudo dinucleotide composition via the auto/cross-covariance approach. Rigorous jackknife tests show that its anticipated accuracy is quite high. For most experimental scientists' convenience, a user-friendly web-server for the predictor has been provided at http://www.jci-bioinfo.cn/iRNAm5C-PseDNC along with a step-by-step user guide, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved. It has not escaped our notice that the approach presented here can also be used to deal with many other problems in genome analysis.

Entities: Chemical Disease Gene Species

Keywords: RNA 5-methylcytosine sites; auto/cross-covariance; physical-chemical property matrix; pseudo dinucleotide composition; web-server

Mesh：

Substances：

Year: 2017 PMID： 28476023 PMCID： PMC5522291 DOI： 10.18632/oncotarget.17104

Source DB: PubMed Journal: Oncotarget ISSN： 1949-2553

INTRODUCTION

Post-transcriptional modifications (PTCM) of RNA plays a paramount role for the metabolism processes of RNAs, such as for their splicing export, immune tolerance, and transcription [1-3]. So far, more than 100 distinct PTCMs have been identified in tRNAs, rRNAs, Mt-tRNAs, miRNAs, lincRNAs, miscRNAs, protein-coding genes, pseudogenes, etc. [1]. Among these modifications, the methylation of the 5-methylcytosine (m5C) is an epigenetic one [4] formed by the action of RNA methyltransferases [5] (Figure 1). The m5C modification is well investigated in DNA, but the corresponding studies in cellular RNA were mainly confined to tRNA and rRNA [6].

Figure 1

Schematic drawing to show the m5C modification in RNA: an important post-transcriptional modification (PTCM) in RNA [4, 5]

During the modification process, a methyl group is attached to the 5th atom of the 6-atom ring. SAM and SAH are the abbreviations of S-adenosylmethionine and S-adenosylhomocysteine, respectively. The former is the source of the methyl group; while the latter, the byproduct.

Schematic drawing to show the m5C modification in RNA: an important post-transcriptional modification (PTCM) in RNA [4, 5]

During the modification process, a methyl group is attached to the 5th atom of the 6-atom ring. SAM and SAH are the abbreviations of S-adenosylmethionine and S-adenosylhomocysteine, respectively. The former is the source of the methyl group; while the latter, the byproduct. Actually, the m5C modification site in RNA has various biological functions, including the one that can regulate RNA metabolism in both eukaryotes and prokaryotes [7]. Furthermore, it plays a key role in yeast cell fate decision [4]. It is also significant for animal (such as mouse) and human embryonic development [1]. Although many efforts have been made by using biological experiments to determine the m5C sites in RNA (see, e.g. [2, 3]), it is time-consuming and expensive to completely rely on the experimental approaches alone. Facing today's explosive growth of uncharacterized RNA sequences, it is highly demanded to develop computational approach to help getting the information. Very recently, in a pioneering study, Feng et al. [8] proposed an interesting method to identify RNA m5C sites via the powerful PseKNC approach [9-11]. But no web-server has been provided for their method, and hence its practical application value is quite limited. In view of this, the present study was initiated to fill such an empty area.

RESULTS AND DISCUSSION

A predictor called “iRNAm5C-PseDNC” has been established. The success rates achieved by it on the benchmark dataset constructed based on experimental servations (Supplementary Information 1) are where the definitions for the metrics Sn, Sp, Acc, and MCC are given in Eq.13 of the MATERIALS AND DISCUSSION section later. Since it is the first web-server predictor ever developed for identifying the m5C sites in RNA sequences, it is not possible to demonstrate its power by comparing with its counterparts for exactly the same purpose. Nevertheless, we can indirectly show its power via a cohort of the anticipated success rates (Table 1) reported from the five powerful web-server predictors in genome and proteome analyses [12] [13-16]. As we can see from the table, the iRNAm5C-PseDNC is with the highest score for Acc (see column 3), and the same is true for MCC (column 4), indicating the proposed predictor is not only high in overall accuracy but also quite stable.

Table 1

A cohort comparison with some existing web-server predictors for different purposes

Predictor's name	Purpose	Acc^a	MCC^a	Sn^a	Sp^a
iRSpot-PseDNC^b	DNA recombination spot	0.8204	0.6380	0.7306	0.8949
iSNO-PseAAC^c	Cysteine S-nitrosylation site	0.6762	0.3515	0.6701	0.6815
iPro54-PseKNC^d	Sigma-54 promoter	0.8043	0.6101	0.7702	0.8385
iRSpot-TNCPseAAC^e	DNA recombination spot	0.8372	0.6710	0.8714	0.7959
iNitro-Tyr^f	Nitrotyrosine site	0.8452	0.4905	0.8176	0.8598
iRNAm5C-PseDNC^g	RNA 5-methylcytosine site	0.9237	0.7935	0.6989	0.9986

a See Eq.13 for the definition.

b See ref. [12].

c See ref. [13].

d See ref. [14].

e See ref. [15].

f See ref. [16].

g The web-server predictor developed in this paper.

a See Eq.13 for the definition. b See ref. [12]. c See ref. [13]. d See ref. [14]. e See ref. [15]. f See ref. [16]. g The web-server predictor developed in this paper. Also, it is instructive to point out that, among the four metrics in Eq.13, the most important are the Acc and MCC. The metrics Sn and Sp are used to measure a predictor from two completely opposite angles, and they are actually constrained with each other [17]. Therefore, it is meaningless to use only one of the two for comparison [18]. When, and only when, both Sn and Sp of the predictor A are higher than those of the predictor B, can we say A is better than B. In other words, a meaningful comparison in this regard should count the scores of both Sn and Sp, or even better, the rate of their combination that is none but the score of MCC [19, 20]. Now, let us use graphic analysis to further show the proposed predictor's quality. Graphs are a useful vehicle for studying complicated biological systems because they can provide intuitive insights, as demonstrated by a series of previous studies (see, e.g., [21-28]). To provide an intuitive illustration, the graph of Receiver Operating Characteristic (ROC) [29, 30] was adopted as given in Figure 2, where the green line is the ROC for iRNAm5C-PseDNC. The area under the ROC curve is called the AUC (area under the curve). Being within the region of 0 and 1, the greater the AUC is, the better the predictor would be. For the current predictor, the AUC is 0.9626, which is very close to 1, the value for a perfect predictor.

Figure 2

A graphical illustration to show the proposed predictor's performances via the ROC (receiver operating characteristic) curves [29, 30]

The area under the ROC curve is called AUC (area under the curve). The greater the AUC value is, the better the performance will be. See the text for further explanation.

A graphical illustration to show the proposed predictor's performances via the ROC (receiver operating characteristic) curves [29, 30]

The area under the ROC curve is called AUC (area under the curve). The greater the AUC value is, the better the performance will be. See the text for further explanation. As shown in a series of recent publications (see, e.g., [18, 19, 31–43]), papers with a user-friendly and publicly accessible web-server will significantly enhance their impacts; this is particularly true for those papers that were aimed at developing various prediction methods [44, 45]. In view of this, the web-server for iRNAm5C-PseDNC has been established at http://www.jci-bioinfo.cn/iRNAm5C-PseDNC. Furthermore, to maximize users’ convenience, a step-to-step guide of how to use it is given in Supplementary Information 2.

MATERIALS AND METHODS

As practiced in a series of recent studies [19, 20, 35–37, 39–41, 46–53] in complying with the 5-step rule proposed in [54], to establish a really useful sequence-based predictor for a biological system, one should make the following five steps very clear: (1) how to construct or select a valid benchmark dataset to train and test the predictor; (2) how to formulate the biological sequence samples with an effective mathematical expression that can truly reflect their essential correlation with the target concerned; (3) how to introduce or develop a powerful algorithm (or engine) to run the prediction; (4) how to properly conduct cross-validation tests to objectively evaluate the anticipated accuracy; (5) how to provide a web-server and user guide to make people very easily to get their desired results. In the rest of this paper, we are to address these point-by-point.

Benchmark dataset

To make the description simpler and cleaner, the Chou's sequential scheme, which had been used by many previous investigators for analyzing the enzyme specificity [55], signal peptide cleavage sites [56], nitrotyrosine sites [16, 57], hydroxyproline or hydroxylysine sites [50, 58], methylation sites [34, 39, 59], protein-protein interaction [32], protein-protein binding sites [60, 61], carbonylation sites [48], and phosphorylation sites [51], was adopted in this study. According to Chou's scheme, a potential RNA m5C modificationsite sample can be generally expressed by where the center symbol denotes the single nucleic acid code cytosine (C), the subscript is an integer, represents the -th upstream nucleotide from the center, the denotes the -th downstream nucleotide, and so forth. The -tuple RNA sample can be further classified into the following two categories: where represents a true sample with C at its center, a false one with C at its center, and the symbol means “a member of” in the set theory. In literature the benchmark dataset usually consists of a training dataset and a testing dataset: the former is for training a model, while the latter for testing it. But as elucidated in a comprehensive review [62], there is no need to artificially separate a benchmark dataset into the two parts if the prediction model is tested by the jackknife or subsampling (K-fold) cross-validation because the outcome thus obtained is actually from a combination of many different independent dataset tests. Thus, the benchmark dataset for the current study can be formulated as where the positive and negative subsets, and , only contain the true and false m5C samples, and , respectively (see Eq.3); while denotes the symbol of “union” in the set theory [62]. The benchmark dataset used in this study was derived from RMBase [1], which is a resource for decoding the landscape of RNA modifications from high-throughput sequencing data. The detailed procedures are as follows. (1) The genomic sequences downloaded from RMBase [63] are in the form of DNA; to make the entire description of this paper in a coherent manner, we first change the code T to U for all the genomic sequences taken from RMBase and make them become RNA sequences. (2) As done in [64], by sliding the -tuple nucleotide window along each of the RNA sequences thus obtained, collected were only those RNA segments with at the center. (3) If the upstream or downstream in a RNA sequence was less than ξ or greater than where L is the length of the RNA sequence concerned, the lacking code was filled with the same code of its nearest neighbor. (4) The RNA segment samples thus obtained were put into the positive subset if their centers were experimentally annotated as the sites; otherwise, into the negative subset . (5) To reduce redundancy and bias, none of included RNA segments had pairwise sequence identity with any other in a same subset. By strictly following the above procedures, we obtained an array of benchmark datasets with different values, and hence different lengths of RNA samples as well (see Eq.2), as illustrated below But it was observed via preliminary tests that when (i.e., the RNA samples formed by 41 nucleotides), the corresponding results were most promising. In other words, we observed a turning point for the success rates at . After this point, the success rates would become going down with the increase of such parameter. Accordingly, hereafter we only consider the 41-tuple nucleotide samples without explicitly mentioning the parameter any more. The benchmark dataset thus obtained is given in Supplementary Information 1, which can also be downloaded at http://www.jci-bioinfo.cn/iRNAm5C/Supp-S1.pdf. It contains 1,900 RNA samples, of which 475 belong to the positive subset and 1,425 to the negative subset .

Sample formulation

An RNA samples in the aforementioned benchmark dataset can be generally expressed as where represents the 1st nucleotide of the RNA sample at its sequence position 1, the 2nd nucleotide at its position 2, and so forth. Except for , they can be any of the four nucleotides; i.e., Based on the sequential model of Eq.6, one could directly utilize BLAST to perform statistical analysis. Unfortunately, this kind of straightforward and intuitive approach failed to work when a query RNA sample did not have significant similarity to any of the character-known RNA sequences. To overcome this problem, investigators have shifted their focus to the discrete or vector model. The reason of doing so is also due to the fact that nearly all the existing machine-learning aorithms can be directly used to handle vector models but not sequences, as elaborated in [45]. One of the well-known vector models for DNA/RNA sequences is the k-tuple nucleotide (or k-mers) composition; i.e., where represents the normalized occurrence frequency of the i-th k-mer, and the symbol T is the transpose operator. When , Eq.8 reduces to where , , and are the normalized occurrence frequencies of adenine, cytosine, thymine, and uracil in the RNA sequence, respectively. When , Eq.8 reduces to where is the normalized occurrence frequencies of AA in the RNA sequence, is that AC, is that AG, is that AU, and so forth. As we can see from above, the vector's dimension will rapidly increase with the k value, causing the so-called “high-dimension disaster” [65] or overfitting problem. This will significantly reduce the deviation tolerance or cluster-tolerant capacity [66], and make the prediction model contain a lot of noise and very unstable. Therefore, the k-mers approach is useful only when the value of k is very small. In other words, it can only be used to incorporate the local or short-range or local sequence-order information, but certainly not the long-range or global sequence-order information. To approximately cover the long-range sequence-order effects, one popular and well-known method is to use the pseudo components that were originally introduced in dealing with protein/peptide sequences [67-73] and recently extended to deal with DNA/RNA sequences [9–11, 34, 74–80]. According to the concept of pseudo components, the RNA sequence can be generally formulated by [11, 54] where the subscript is integer and its value as well as the components will depend on how to extract the desired information from the RNA sequence of Eq.6. In this study, we used the approach called “physical-chemical property matrix combined with auto/cross-covariance” proposed by Liu et al. [39] to define the components in Eq.7. According to that approach, the vector components in Eq.7 are given by where is an integer within the range from 0 to 39. Using exactly the same calculation approach as elaborated in [39], we found that was optimal choice for the current study. As for how to calculate the concrete values in Eq.12, see ref. [39] where a crystal clear description had been given and hence there is no need to repeat here.

Random forest algorithm

Being a powerful algorithm, the random forest (RF) has been increasingly used to analyze various different problems in computational biology (see, e.g. [32, 36, 37, 40, 48, 50–52, 60, 61, 81–84]). The essence of RF is to compare each individual classifier as a tree, and the combination of many such classifiers as a forest. In this study, 100 trees were used for the forest, and dimension of the random subspace was 22. Each tree in the forest is trained with different part of the benchmark dataset, and hence may yield a different result. The final outcome is determined via a vote from all the trees. For more information about RF, see [85] where a very detailed description has been given, and hence there is no need to repeat here. The final predictor obtained via the aforementioned procedures is called as iRNAm5C-PseDNC, where “i” stands for “identify”, and “RNAm5C” for “RNA 5-methylcytosine modification sites”, and “PseDNC” for “pseudo dinucleotide composition”.

Test procedure

One of the important procedures [54] in developing a new prediction method is how to objectively evaluate its anticipated success rate [54]. To address this, we need to consider two issues. (1) What metrics should be used to quantitatively reflect the predictor's quality? (2) What kind of test approach should be utilized to score the metrics?

Metrics formulation

The following metrics are generally used to measure the prediction quality from four different angles: (1) Acc for measuring the overall accuracy of a predictor, (2) MCC for its stability, (3) Sn for its sensitivity, and (4) Sp for its specificity [86]. Unfortunately, their conventional formulations as given in [86] lack intuitiveness and most experimental scientists feel difficult to understand them, particularly for the MCC. Interestingly, using the Chou's symbols introduced in studying signal peptides [56], Xu et al. [13] and Chen et al. [12] converted them into a set of four intuitive equations, as given by where represents the total number of the true m5C sites investigated, while is the number of the true m5C sites incorrectly predicted to be of false m5C site; is the total number of the false m5C sites investigated, while is the number of the false m5C sites incorrectly predicted to be of true m5C site. According to Eq.13, it is crystal clear to see the following. When meaning none of the true m5C sites are incorrectly predicted to be of false m5C site, we have the sensitivity . When meaning that all the true m5C sites are incorrectly predicted to be of false m5C site, we have the sensitivity . Likewise, when meaning none of the false m5C sites are incorrectly predicted to be of m5C site, we have the specificity ; whereas meaning that all the false m5C sites are incorrectly predicted to be of true m5C sites, we have the specificity . When meaning that none of true m5C sites in the positive dataset and none of the false m5C sites in the negative dataset are incorrectly predicted, we have the overall accuracy and ; when and meaning that all the true m5C sites in the positive dataset and all the false m5C sites in the negative dataset are incorrectly predicted, we have the overall accuracy and ; whereas when and we have and meaning no better than random guess. Therefore, Eq.13 has made the meanings of sensitivity, specificity, overall accuracy, and stability much more intuitive and easier-to-understand, particularly for the meaning of MCC, as concurred recently by many investigators (see, e.g., [18, 20, 31–35, 47–52, 60, 75–77, 84, 87–92]). Note that, however, the set of equations defined in Eq.13 is valid only for the single-label systems. For the multi-label systems whose emergence has become more frequent in system biology [93-95] and system medicine [96] or biomedicine [40], a completely different set of metrics are needed as elaborated in [97].

Test method

Now let us discuss what kind of test method should be used to score the four metrics in Eq.13. In statistical analysis, the following three methods are often used to test a predictor: (1) independent dataset test, (2) subsampling (or K-fold cross-validation) test, and (3) jackknife test [98]. Of these three, however, the jackknife test is deemed the least arbitrary that can always yield a unique outcome for a given benchmark dataset as elucidated in [54]. Accordingly, the jackknife test has been widely recognized and increasingly used by investigators to examine the quality of various predictors (see, e.g., [15, 99–108]). Accordingly, here we also used the jackknife test to check the quality of iRNAm5C-PseDNC predictor. During the jackknifing process, both the training dataset and testing dataset are actually open, and each sample will be in turn moved between the two. The jackknife test can exclude the “memory” effect. Also, the arbitrariness problem [54] rooted in the independent dataset and subsampling tests can be completely avoided because the outcome obtained by the jackknife cross-validation is always unique for a given benchmark dataset.

101 in total

1. Subcellular location prediction of apoptosis proteins.

Authors: Guo-Ping Zhou; Kutbuddin Doctor
Journal: Proteins Date: 2003-01-01

2. pRNAm-PC: Predicting N(6)-methyladenosine sites in RNA sequences via physical-chemical properties.

Authors: Zi Liu; Xuan Xiao; Dong-Jun Yu; Jianhua Jia; Wang-Ren Qiu; Kuo-Chen Chou
Journal: Anal Biochem Date: 2015-12-31 Impact factor: 3.365

3. Predicting protein submitochondrial locations by incorporating the positional-specific physicochemical properties into Chou's general pseudo-amino acid compositions.

Authors: Ya-Sen Jiao; Pu-Feng Du
Journal: J Theor Biol Date: 2017-01-08 Impact factor: 2.691

4. A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins.

Authors: K C Chou
Journal: J Biol Chem Date: 1993-08-15 Impact factor: 5.157

5. Identification of protein-protein binding sites by incorporating the physicochemical properties and stationary wavelet transforms into pseudo amino acid composition.

Authors: Jianhua Jia; Zi Liu; Xuan Xiao; Bingxiang Liu; Kuo-Chen Chou
Journal: J Biomol Struct Dyn Date: 2015-10-29

6. Prediction of β-lactamase and its class by Chou's pseudo-amino acid composition and support vector machine.

Authors: Ravindra Kumar; Abhishikha Srivastava; Bandana Kumari; Manish Kumar
Journal: J Theor Biol Date: 2014-10-22 Impact factor: 2.691

7. Identification of real microRNA precursors with a pseudo structure status composition approach.

Authors: Bin Liu; Longyun Fang; Fule Liu; Xiaolong Wang; Junjie Chen; Kuo-Chen Chou
Journal: PLoS One Date: 2015-03-30 Impact factor: 3.240

8. iPhos-PseEn: identifying phosphorylation sites in proteins by fusing different pseudo components into an ensemble classifier.

Authors: Wang-Ren Qiu; Xuan Xiao; Zhao-Chun Xu; Kuo-Chen Chou
Journal: Oncotarget Date: 2016-08-09

9. iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences.

Authors: Wei Chen; Pengmian Feng; Hui Yang; Hui Ding; Hao Lin; Kuo-Chen Chou
Journal: Oncotarget Date: 2017-01-17

10. Some remarks on protein attribute prediction and pseudo amino acid composition.

Authors: Kuo-Chen Chou
Journal: J Theor Biol Date: 2010-12-17 Impact factor: 2.691

35 in total

1. Predicting membrane proteins and their types by extracting various sequence features into Chou's general PseAAC.

Authors: Ahmad Hassan Butt; Nouman Rasool; Yaser Daanial Khan
Journal: Mol Biol Rep Date: 2018-09-20 Impact factor: 2.316

2. Structural insights of dipeptidyl peptidase-IV inhibitors through molecular dynamics-guided receptor-dependent 4D-QSAR studies.

Authors: Rajesh B Patil; Euzebio G Barbosa; Jaiprakash N Sangshetti; Vishal P Zambre; Sanjay D Sawant
Journal: Mol Divers Date: 2018-03-13 Impact factor: 2.943

3. NSUN2-mediated RNA m⁵C modification modulates uveal melanoma cell proliferation and migration.

Authors: Guangying Luo; Weiwei Xu; Xiaoyan Chen; Siqi Wang; Jiao Wang; Feng Dong; Dan-Ning Hu; Peter S Reinach; Dongsheng Yan
Journal: Epigenetics Date: 2022-06-26 Impact factor: 4.861

4. Geographic encoding of transcripts enabled high-accuracy and isoform-aware deep learning of RNA methylation.

Authors: Daiyun Huang; Kunqi Chen; Bowen Song; Zhen Wei; Jionglong Su; Frans Coenen; João Pedro de Magalhães; Daniel J Rigden; Jia Meng
Journal: Nucleic Acids Res Date: 2022-10-14 Impact factor: 19.160

5. Deepm5C: A deep-learning-based hybrid framework for identifying human RNA N5-methylcytosine sites using a stacking strategy.

Authors: Md Mehedi Hasan; Sho Tsukiyama; Jae Youl Cho; Hiroyuki Kurata; Md Ashad Alam; Xiaowen Liu; Balachandran Manavalan; Hong-Wen Deng
Journal: Mol Ther Date: 2022-05-06 Impact factor: 12.910

6. PScL-HDeep: image-based prediction of protein subcellular location in human tissue using ensemble learning of handcrafted and deep learned features with two-layer feature selection.

Authors: Matee Ullah; Ke Han; Fazal Hadi; Jian Xu; Jiangning Song; Dong-Jun Yu
Journal: Brief Bioinform Date: 2021-11-05 Impact factor: 13.994

7. UltraPse: A Universal and Extensible Software Platform for Representing Biological Sequences.

Authors: Pu-Feng Du; Wei Zhao; Yang-Yang Miao; Le-Yi Wei; Likun Wang
Journal: Int J Mol Sci Date: 2017-11-14 Impact factor: 5.923

8. Assessing the Performances of Protein Function Prediction Algorithms from the Perspectives of Identification Accuracy and False Discovery Rate.

Authors: Chun Yan Yu; Xiao Xu Li; Hong Yang; Ying Hong Li; Wei Wei Xue; Yu Zong Chen; Lin Tao; Feng Zhu
Journal: Int J Mol Sci Date: 2018-01-08 Impact factor: 5.923

Review 9. Epigenetics: Roles and therapeutic implications of non-coding RNA modifications in human cancers.

Authors: Dawei Rong; Guangshun Sun; Fan Wu; Ye Cheng; Guoqiang Sun; Wei Jiang; Xiao Li; Yi Zhong; Liangliang Wu; Chuanyong Zhang; Weiwei Tang; Xuehao Wang
Journal: Mol Ther Nucleic Acids Date: 2021-05-01 Impact factor: 8.886

10. m5C-Related lncRNAs Predict Overall Survival of Patients and Regulate the Tumor Immune Microenvironment in Lung Adenocarcinoma.

Authors: Junfan Pan; Zhidong Huang; Yiquan Xu
Journal: Front Cell Dev Biol Date: 2021-06-29