Literature DB >> 29858081

iRNA-3typeA: Identifying Three Types of Modification at RNA's Adenosine Sites.

Wei Chen¹, Pengmian Feng², Hui Yang³, Hui Ding³, Hao Lin⁴, Kuo-Chen Chou⁵.

Abstract

RNA modifications are additions of chemical groups to nucleotides or their local structural changes. Knowledge about the occurrence sites of these modifications is essential for in-depth understanding of the biological functions and mechanisms and for treating some genomic diseases as well. With the avalanche of RNA sequences generated in the post-genomic age, many computational methods have been proposed for identifying various types of RNA modifications one by one. However, so far no method whatsoever has been developed for simultaneously identifying several different types of RNA modifications. To address such a challenge, we developed a predictor called "iRNA-3typeA," by which we can simultaneously identify the occurrence sites of the following three most frequently observed modifications in RNA: (1) N1-methyladenosine (m1A), (2) N6-methyladenosine (m6A), and (3) adenosine to inosine (A-to-I). It has been shown via rigorous cross-validations for the RNA sequences from Homo sapiens and Mus musculus transcriptomes that the success rates achieved by the powerful new predictor are quite high. For the convenience of broad experimental scientists, a user-friendly web server for iRNA-3typeA has been established at http://lin-group.cn/server/iRNA-3typeA/. It is anticipated that iRNA-3typeA may become a useful high throughput tool for genome analysis.

Entities: Chemical Disease Gene Species

Keywords: N(1)-methyladenosine; N(6)-methyladenosine; RNA modification; adenosine to inosine editing; five-step rules; web server

Year: 2018 PMID： 29858081 PMCID： PMC5992483 DOI： 10.1016/j.omtn.2018.03.012

Source DB: PubMed Journal: Mol Ther Nucleic Acids ISSN： 2162-2531 Impact factor: 8.886

Introduction

RNA modification means the addition of chemical groups to its constitutional nucleotides or structural changes therein. So far, more than 100 types of RNA modifications have been observed in cellular RNAs of all living organisms. Because they are involved in a series of crucial biological activities, such as mRNA splicing, mRNA nuclear processing, mRNA export, and mRNA decay,3, 4, 5, 6 particularly linked with human diseases, RNA modifications have drawn great attention in the scientific community. With the development of high-throughput experimental techniques,7, 8, 9 lots of RNA modification data have been acquired; they are very helpful for revealing the novel functions of RNA modifications. As indicated in a recent review, however, most of these methods are unable to discriminate among the different RNA modifications that may simultaneously occur in the same RNA molecule. For example, the adenosine usually undergoes N1-methyladenosine (m1A), N6-methyladenosine (m6A), and adenosine to inosine (A-to-I or ) modifications (Figure 1). Unfortunately, using the aforementioned techniques, one could not detect whether different types of RNA modifications might take place at the same time, let alone analyze their combinational biological functions..

Figure 1

The Three Common Types of Modifications in RNA

(1) N1-methyladenosine (m1A), (2) N6-methyladenosine (m6A), and (3) adenosine to inosine (A-to-I).

The Three Common Types of Modifications in RNA (1) N1-methyladenosine (m1A), (2) N6-methyladenosine (m6A), and (3) adenosine to inosine (A-to-I). Therefore, it is urgently needed to develop computational methods to address this problem. As excellent complements to experimental techniques, computational methods have been developed to identify RNA modifications12, 13, 14, 15, 16, 17, 18 via machine learning to train computational models based on the large data yielded from the high-throughput experiments. However, rarely are they able to simultaneously identify multiple RNA modifications. The present study was devoted to developing a bioinformatics tool that can identify the RNA modification types for m1A, m6A, and that may simultaneously occur on adenosine in both Homo sapiens and Mus musculus transcriptomes. As shown in a series of recent publications,19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31 in developing a bioinformatics tool, complying with the five-step rules yields the following advantages: (1) clearer in logic deduction, (2) better illumination in stimulating other relevant tools, and (3) more usefulness in practical application. In view of this, we elaborate the following procedures required in the five-step rules: (1) benchmark dataset, (2) sample formulation, (3) operative machine, (4) cross-validation, and (5) web server, and they are embedded into the rubrics according to the journal’s format.

Results and Discussion

Performance Report

Listed in Table 1 are the jackknife test results obtained by the proposed predictor on the benchmark datasets (Supplemental Information S1 and Supplemental Information S2 available at http://lin-group.cn/server/iRNA3typeA/data.htm) for H. sapiens and M. musculus, respectively. As we can see from the table, the rates for both overall accuracy (Acc) and stability (MCC) are quite high for all the three different types of modifications investigated, indicating that the predictor is not only high in overall success rate but also quite stable. Therefore, the potential is quite high for iRNA-type3A to become a high-throughput tool in both basic research and drug development.

Table 1

The Success Rates Achieved by iRNA-3typeA via Jackknife Tests on the Benchmark Datasets for H. sapiens and M. musculus, Respectively

Species	Type of Modification	Sn (%)	Sp (%)	Acc (%)	MCC
H. sapiens	m¹Aa	98.38	99.89	99.13	0.98
	m⁶Ab	81.68	99.11	90.38	0.82
	A→Ic	86.18	95.23	90.71	0.82
M. musculus	m¹Ad	97.46	100.00	98.73	0.97
	m⁶Ae	77.79	100.00	88.39	0.80
	A→If	96.75	100.00	98.38	0.96

The parameters used for SVM are and = 0.0078125.

The parameters used for SVM are and = 3.05158e-5.

The parameters used for SVM are and = 0.0078125.

The parameters used for SVM are and = 0.00012207.

The parameters used for SVM are and = 0.000488281.

The Success Rates Achieved by iRNA-3typeA via Jackknife Tests on the Benchmark Datasets for H. sapiens and M. musculus, Respectively The parameters used for SVM are and = 0.0078125. The parameters used for SVM are and = 3.05158e-5. The parameters used for SVM are and = 0.0078125. The parameters used for SVM are and = 0.0078125. The parameters used for SVM are and = 0.00012207. The parameters used for SVM are and = 0.000488281. It is instructive to point out that, although the current predictor is limited in identifying m1A, m6A, and sites for the RNA sequences from H. sapiens and M. musculus, with more experimental data available for other types of modifications and other species in future, we can easily to extend our model to cover more different types of modifications and more different species. Therefore, the current predictor is just a good start; it will be subjected to updates with the aim to continuously enhance its power and coverage scope.

Comparison with Other Classifiers

The proposed predictor iRNA-3typeA is the first predictor ever constructed for identifying the three types of RNA modifications (m1A; m6A; ) simultaneously. It is not possible to show its power via a conventional comparison since there is no other predictor whatsoever that can do the same. Nevertheless, below we can carry out a special comparison to further demonstrate its superiority. As mentioned above, the operative machine used for iRNA-3typeA is a support vector machine (SVM) classifier. What would happen if we use other classifiers instead? Listed in Table 2 are the results when the SVM classifier was substituted with the other classifiers, respectively.

Table 2

The Comparative Results of the Proposed Predictor When Its Operating Algorithm Was Replaced from SVM to Other Classifiers

Classifier	Species	Modification Type	Sn (%)	Sp (%)	Acc (%)	MCC
BayesNeta	H. sapiens	m¹A	98.81	98.85	98.83	0.98
		m⁶A	82.04	100.00	91.02	0.83
		A→I	88.50	89.57	89.03	0.78
	M. musculus	m¹A	97.18	98.78	97.98	0.96
		m⁶A	77.79	100.00	88.90	0.80
		A→I	96.51	99.88	98.20	0.96
Naive Bayesa	H. sapiens	m¹A	98.16	98.30	98.23	0.96
		m⁶A	82.04	99.73	90.88	0.83
		A→I	89.40	87.04	88.22	0.76
	M. musculus	m¹A	96.43	97.75	97.09	0.94
		m⁶A	77.79	98.62	88.22	0.78
		A→I	95.91	97.95	96.93	0.94
J48 Treea	H. sapiens	m¹A	98.77	99.40	99.09	0.98
		m⁶A	82.48	84.35	83.41	0.67
		A→I	88.18	89.04	88.60	0.77
	M. musculus	m¹A	96.71	98.68	97.70	0.95
		m⁶A	83.03	82.21	82.62	0.65
		A→I	96.27	99.04	97.65	0.95
SVMb	H. sapiens	m¹A	98.46	99.89	99.18	0.98
		m⁶A	80.44	100.00	90.23	0.82
		A→I	86.73	95.40	91.07	0.82
	M. musculus	m¹A	97.46	100.00	98.73	0.97
		m⁶A	77.79	100.00	88.90	0.80
		A→I	97.35	100.00	98.67	0.97

All the rates below are obtained by the 10-fold cross-validations on the same benchmark datasets (Supplemental Information S1 and Supplemental Information S2 available at http://lin-group.cn/server/iRNA3typeA/data.htm).

Taken from the WEKA package.

Proposed in this paper.

The Comparative Results of the Proposed Predictor When Its Operating Algorithm Was Replaced from SVM to Other Classifiers All the rates below are obtained by the 10-fold cross-validations on the same benchmark datasets (Supplemental Information S1 and Supplemental Information S2 available at http://lin-group.cn/server/iRNA3typeA/data.htm). Taken from the WEKA package. Proposed in this paper. From the table, we can see the following: (1) the SVM classifier is better than J48 Tree in all the metrics rates. (2) Although the SVM classifier is a little bit lower than the BayesNet classifier and Naive Bayes classifier in identifying the m6A sites for H. sapiens, its accuracies in identifying all the other types of modifications for both H. sapiens and M. musculus are significantly higher than those of BayesNet and Naive Bayes. All these results have further indicated that the SVM classifier is indeed a correct choice for the iRNA-3typeA predictor.

Web Server and User Guide

The last step of the five-step rules is about the web server. It is indeed important because user-friendly and publicly accessible web servers represent the future direction for developing practically more useful predictors. Actually, it has been demonstrated by a series of recent publications (see, e.g., Cheng et al.,25, 34, 35, 36 Liu et al., Lin et al., Jia et al.,38, 39 and Cheng and Xiao) that a new prediction method with its web server available would significantly enhance its impacts.41, 42 In view of this, the web server for iRNA-3typeA has been established. Furthermore, to maximize the convenience of broad experimental scientists, a step-by-step guide is given below: Step 1. Open the iRNA-3typeA web server at http://lin-group.cn/server/iRNA-3typeA; you will see the top page of the web server as shown in Figure 2A.

Figure 2

The Semi-screenshot for the Top Page of the iRNA-3typeA Web Server and the Prediction Result of the Two Example Query Sequences

The Semi-screenshot for the top page of the iRNA-3typeA Web Server (top panel) and the Prediction Result of the two example query sequences (bottom panel).

The Semi-screenshot for the Top Page of the iRNA-3typeA Web Server and the Prediction Result of the Two Example Query Sequences The Semi-screenshot for the top page of the iRNA-3typeA Web Server (top panel) and the Prediction Result of the two example query sequences (bottom panel). Step 2. Either type or copy/paste the query RNA sequences (in FASTA format) into the input box. Example sequences can be found by clicking on the Example button. Step 3. Click the open circle (H. sapiens and M. musculus) to choose the species concerned, followed by clicking the Submit button. For example, if using the query RNA sequences in the Example window as the input and choosing H. sapiens, after submission you will see the predicted results summarized in a table (Figure 2B), clearly indicating (1) the adenosine at position 21of sequence #1 has the potential to be of the site for m1A or A-to-I editing modification. (2) The adenosine at position 21 of sequence #2 has the potential to be of m6A modification only. All these predicted results are fully consistent with experimental observations.

Materials and Methods

Benchmark Datasets

The benchmark datasets for m1A, m6A, and A-to-I editing sites in H. sapiens and M. musculus genomes were derived from the previous works.12, 14, 43 Listed in Table 3 are the numbers of positive and negative samples for each of the benchmark datasets. It has been found by similar approaches12, 14 that the optimal length of the sequence samples in the benchmark datasets are 41nt, with the modified sites (m1A, m6A, or editing site) at the center. For readers’ convenience, the benchmark dataset thus obtained for H. sapiens is given in Supplemental Information S1, while that for M. musculus given in Supplemental Information S2; both can be downloaded from the link at http://lin-group.cn/server/iRNA3typeA/data.htm.

Table 3

A Breakdown of the Benchmark Dataset

Species	Attribute	Number of Samples
Species	Attribute	m¹A	m⁶A	A→I
H. sapiens	positive	6,366	1,130	3,000
H. sapiens	negative	6,366	1,130	3,000
M. musculus	positive	1,064	725	831
M. musculus	negative	1,064	725	831

A Breakdown of the Benchmark Dataset

Sample Formulation

An RNA sample with 41 nt is usually sequentially formulated bywheredenotes the nucleotide at the i-th sequence position, and is the a symbol in the set theory meaning “member of.” To enable the existing machine-learning algorithms handle the RNA sample, the first thing we need to do is to convert its sequential formulation into a vector. But a vector in a discrete framework might totally miss all the sequence-order information or pattern feature. To deal with this problem, the PseAAC (pseudo amino acid composition) was introduced. Ever since the concept of PseAAC was proposed, it has been swiftly penetrated into many biomedicine and drug development areas45, 46 and nearly all the areas of computational proteomics (see, e.g.,Esmaeili et al., Mohabatkar et al., Nanni et al., Pacharawongsakda and Theeramunkong, Mondal and Pai, Ahman et al., Kabir and Hayat, Yu et al., Zhang and Duan, Muthu Krishnan, and a long list of references cited in two review papers42, 57). Encouraged by the successes of using PseAAC to deal with protein/peptide sequences, this idea has been extended to deal with DNA/RNA sequences21, 28, 37, 58, 59, 60 in computational genomics via PseKNC (pseudo K-tuple nucleotide composition).61, 62 According to Chen et al., the general form of PseKNC can be formulated aswhere T is the transposing operator, the subscript is an integer, and its value and the components will depend on how to extract the desired features and properties from the RNA sequence (cf. Equation 1). In this study, their definitions are described below. The four bases (A, C, G, and U) of RNA have different chemical properties and structures.64, 65 Therefore, based on their different chemical properties and structures,64, 65 A, C, G, and U can be represented by (1, 1, 1), (0, 0, 1), (1, 0, 0), and (0, 1, 0), respectively.20, 27 For instance, the RNA sequence with six nucleotides “GUGCAG” can be expressed by the vector of components; i.e., [1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0]. Moreover, to incorporate into Equation 3 the sequence-coupled information for the nucleotides around the modification sites, we adopt the lingering density as defined belowwhere is the density of the nucleotide at the site of a RNA sequence, the length of the sliding substring concerned; denotes each of the site locations counted in the substring, andFor example, the RNA sequence “GUGCAG” can be represented by the vector [1, 0.5, 0.66, 0.25, 0.2, 0.5]. Thus, by using both nucleotide chemical properties and the lingering density (cf. Equation 4), each nucleotide can be defined by four variables. Accordingly, the RNA sequence of Equation 1 can be defined by a vector with components; namely for Equation 3 now.

Operative Machine

In this study, the SVM was chosen as the operative machine. The SVM has been widely used in computational genomics and proteomics (see, e.g., Ehsan et al., Feng et al.,20, 27, 67, 68, 69 Chen et al.,70, 71, 72 Lin et al., Lai et al., Zhao et al., and Yang et al.). The implementation of the SVM was conducted by using the LibSVM package 3.18 available at https://www.csie.ntu.edu.tw/∼cjlin/libsvm/. The radial basis kernel function (RBF) was used to obtain the classification hyperplane, and the grid search method was applied to optimize the regularization parameter C and kernel parameter γ. The predictor obtained via the above procedures is called “iRNA-3typeA,” where “i” stands for “identify,” and “3typeA” means RNA’s “three types of modifications at adenosine sites.” Illustrated in Figure 3 is a flowchart to show the process of how the iRNA-3typeA predictor is working.

Figure 3

A Flowchart to Show How the iRNA-3typeA Predictor Is Working

Cross-Validation

To evaluate the quality of a new predictor, we need to consider the following two problems. What metrics should be used to quantitatively display its performance? And what concrete procedure should be followed to derive the metrics’ values?where represents the total number of positive samples investigated, while is the number of positive samples incorrectly predicted to be negative, and represents the total number of negative samples investigated, while the number of the negative samples incorrectly predicted to be positive. With the set of formulations in Equation 6, the meanings of Sn, Sp, Acc, and MCC have become much more intuitive and easier to understand, as discussed in a series of recent studies in various biological areas (see, e.g., Liu et al.,21, 24, 28, 60 Ehsan et al., Feng et al.,20, 27 Song et al., Lin et al., and Xu et al.80, 81). A set of four metrics. In literature, the following four conventional metrics are generally used to evaluate a predictor’s quality: (1) Acc, (2) MCC, (3) sensitivity (Sn), and (4) specificity (Sp). But the conventional expressions copied directly from math books are lacking in inductivity and hard to understand for most biological scientists. Fortunately, by using the symbols introduced by Chou in studying signal peptides, the four metrics can be converted to a set of intuitive ones58, 79 as given below: Jackknife test. Now the next problem is how to test the values of these metrics in an objective way. As is well known, the independent dataset test, subsampling (or K-fold cross-validation) test, and jackknife test are the three cross-validation methods widely used for testing a prediction method. Of the three test methods, however, the jackknife test is deemed the least arbitrary and most objective one. Accordingly, the jackknife test has been widely recognized and increasingly adopted by investigators to examine the quality of various predictors (see, e.g., Ahmad et al.,52, 83 Lin et al., Tang et al., Tripathi and Pandey, and Dao et al.). In view of this, the jackknife test was also adopted in the current study to examine the proposed predictor. During the jackknife test, each sample in the benchmark dataset is in turn singled out as an independent test sample and all the rule-parameters are calculated without including the one being identified. One more advantage of using the jackknife test is that there is no need to artificially separate the benchmark dataset into two subsets, one for training the model and one for testing it. This is because the outcome obtained by the jackknife test is actually a combination from many different independent dataset tests.88, 89, 90

Author Contributions

W.C. and H.L. designed the study; P.F., H.Y., and H.D. conducted the experiments; W.C., H.L., and K.-C.C. analyzed the results; W.C., H.L., and K.-C.C. wrote the paper.

Conflicts of Interest

The authors declare no conflict of interest.

85 in total

1. Perspectives in Medicinal Chemistry.

Authors: Guo-Ping Zhou; Wei-Zhu Zhong
Journal: Curr Top Med Chem Date: 2016 Impact factor: 3.295

2. Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses.

Authors: Maryam Esmaeili; Hassan Mohabatkar; Sasan Mohsenzadeh
Journal: J Theor Biol Date: 2009-12-02 Impact factor: 2.691

3. A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins.

Authors: K C Chou
Journal: J Biol Chem Date: 1993-08-15 Impact factor: 5.157

4. High-resolution N(6) -methyladenosine (m(6) A) map using photo-crosslinking-assisted m(6) A sequencing.

Authors: Kai Chen; Zhike Lu; Xiao Wang; Ye Fu; Guan-Zheng Luo; Nian Liu; Dali Han; Dan Dominissini; Qing Dai; Tao Pan; Chuan He
Journal: Angew Chem Int Ed Engl Date: 2014-12-09 Impact factor: 15.336

5. Using Chou's general PseAAC to analyze the evolutionary relationship of receptor associated proteins (RAP) with various folding patterns of protein domains.

Authors: S Muthu Krishnan
Journal: J Theor Biol Date: 2018-02-22 Impact factor: 2.691

6. Identification of Bacterial Cell Wall Lyases via Pseudo Amino Acid Composition.

Authors: Xin-Xin Chen; Hua Tang; Wen-Chao Li; Hao Wu; Wei Chen; Hui Ding; Hao Lin
Journal: Biomed Res Int Date: 2016-06-29 Impact factor: 3.411

7. iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences.

Authors: Wei Chen; Pengmian Feng; Hui Yang; Hui Ding; Hao Lin; Kuo-Chen Chou
Journal: Oncotarget Date: 2017-01-17

8. Some remarks on protein attribute prediction and pseudo amino acid composition.

Authors: Kuo-Chen Chou
Journal: J Theor Biol Date: 2010-12-17 Impact factor: 2.691

9. Identification of antioxidants from sequence information using naïve Bayes.

Authors: Peng-Mian Feng; Hao Lin; Wei Chen
Journal: Comput Math Methods Med Date: 2013-08-24 Impact factor: 2.238

10. Prediction of phosphothreonine sites in human proteins by fusing different features.

Authors: Ya-Wei Zhao; Hong-Yan Lai; Hua Tang; Wei Chen; Hao Lin
Journal: Sci Rep Date: 2016-10-04 Impact factor: 4.379

31 in total

1. MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters.

Authors: Meng Zhang; Fuyi Li; Tatiana T Marquez-Lago; André Leier; Cunshuo Fan; Chee Keong Kwoh; Kuo-Chen Chou; Jiangning Song; Cangzhi Jia
Journal: Bioinformatics Date: 2019-09-01 Impact factor: 6.937

2. Predicting membrane proteins and their types by extracting various sequence features into Chou's general PseAAC.

Authors: Ahmad Hassan Butt; Nouman Rasool; Yaser Daanial Khan
Journal: Mol Biol Rep Date: 2018-09-20 Impact factor: 2.316

Review 3. Structural Variability in the RLR-MAVS Pathway and Sensitive Detection of Viral RNAs.

Authors: Qiu-Xing Jiang
Journal: Med Chem Date: 2019 Impact factor: 2.745

4. Twenty years of bioinformatics research for protease-specific substrate and cleavage site prediction: a comprehensive revisit and benchmarking of existing methods.

Authors: Fuyi Li; Yanan Wang; Chen Li; Tatiana T Marquez-Lago; André Leier; Neil D Rawlings; Gholamreza Haffari; Jerico Revote; Tatsuya Akutsu; Kuo-Chen Chou; Anthony W Purcell; Robert N Pike; Geoffrey I Webb; A Ian Smith; Trevor Lithgow; Roger J Daly; James C Whisstock; Jiangning Song
Journal: Brief Bioinform Date: 2019-11-27 Impact factor: 11.622

5. EMDLP: Ensemble multiscale deep learning model for RNA methylation site prediction.

Authors: Honglei Wang; Hui Liu; Tao Huang; Gangshen Li; Lin Zhang; Yanjing Sun
Journal: BMC Bioinformatics Date: 2022-06-08 Impact factor: 3.307

6. Large-scale comparative assessment of computational predictors for lysine post-translational modification sites.

Authors: Zhen Chen; Xuhan Liu; Fuyi Li; Chen Li; Tatiana Marquez-Lago; André Leier; Tatsuya Akutsu; Geoffrey I Webb; Dakang Xu; Alexander Ian Smith; Lei Li; Kuo-Chen Chou; Jiangning Song
Journal: Brief Bioinform Date: 2019-11-27 Impact factor: 11.622

7. RMDisease: a database of genetic variants that affect RNA modifications, with implications for epitranscriptome pathogenesis.

Authors: Kunqi Chen; Bowen Song; Yujiao Tang; Zhen Wei; Qingru Xu; Jionglong Su; João Pedro de Magalhães; Daniel J Rigden; Jia Meng
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

8. Plant-mSubP: a computational framework for the prediction of single- and multi-target protein subcellular localization using integrated machine-learning approaches.

Authors: Sitanshu S Sahu; Cristian D Loaiza; Rakesh Kaundal
Journal: AoB Plants Date: 2019-10-17 Impact factor: 3.276

9. Classifying Included and Excluded Exons in Exon Skipping Event Using Histone Modifications.

Authors: Wei Chen; Pengmian Feng; Hui Ding; Hao Lin
Journal: Front Genet Date: 2018-10-01 Impact factor: 4.599

10. iBCE-EL: A New Ensemble Learning Framework for Improved Linear B-Cell Epitope Prediction.

Authors: Balachandran Manavalan; Rajiv Gandhi Govindaraj; Tae Hwan Shin; Myeong Ok Kim; Gwang Lee
Journal: Front Immunol Date: 2018-07-27 Impact factor: 7.561